Gene mapping in the wild with SNPs: guidelines and future directions
- First Online:
- Cite this article as:
- Slate, J., Gratten, J., Beraldi, D. et al. Genetica (2009) 136: 97. doi:10.1007/s10709-008-9317-z
- 758 Views
One of the biggest challenges facing evolutionary biologists is to identify and understand loci that explain fitness variation in natural populations. This review describes how genetic (linkage) mapping with single nucleotide polymorphism (SNP) markers can lead to great progress in this area. Strategies for SNP discovery and SNP genotyping are described and an overview of how to model SNP genotype information in mapping studies is presented. Finally, the opportunity afforded by new generation sequencing and typing technologies to map fitness genes by genome-wide association studies is discussed.
KeywordsGene discoveryQTLLinkageMappingSNPWild population
Longitudinal studies of wild animal populations have proven invaluable for studying selection, genetic architecture and microevolution of fitness-related traits, especially since the widespread uptake of the ‘animal model’ approach to quantitative genetic parameter estimation (Kruuk 2004; Kruuk and Hill 2008; Merilä et al. 2001). Unfortunately, identifying the actual genes responsible for variation in fitness has proven more difficult, even though the statistical framework and some appropriate study populations have been available for some time (Slate 2005). To date, there are relatively few examples of quantitative trait locus (QTL) studies being conducted in unmanipulated, wild populations (Beraldi et al. 2007a, b; Slate et al. 2002) and these studies have only identified approximate locations of a limited number of QTL. However, there is now a great opportunity to synthesise gene discovery with quantitative genetic studies of wild populations due to the increasing ease (and decreasing cost) with which genomics studies can now be conducted in non-model organisms (Ellegren 2008; Ellegren and Sheldon 2008).
To date, QTL mapping studies in the wild have all been conducted by typing a suite of microsatellite markers, originally identified in closely related model organisms. Although microsatellites have a number of well-documented properties that make them excellent for molecular ecology research (Jarne and Lagoda 1996), they are not ideal for gene mapping. The main limitation of microsatellites is that typing methods are not highly automated; it is difficult to type more than about ten loci in a single reaction. Relative to other markers they are not as abundant in the genome, and marker discovery has traditionally been time-consuming, especially when large numbers of loci are required. There is a suspicion that previous mapping studies in wild populations have approached the limits of what can be realistically achieved with microsatellites and a pedigree of several hundred individuals; i.e. low marker density genome scans that yield crude estimates of QTL location and magnitude.
Single nucleotide polymorphisms (SNPs) are the most abundant type of genetic polymorphism in most, if not all, genomes. In recent years SNPs have attracted growing interest from researchers who have recognised their potential for addressing a number of outstanding questions in evolutionary biology and ecology (Luikart et al. 2003; Morin et al. 2004; Nielsen 2005). The advantages (and disadvantages) of SNPs relative to other types of molecular marker have been reviewed elsewhere (Morin et al. 2004), and it is not the aim of this article to duplicate that material. However, a relatively new application of SNPs is as a tool for carrying out gene mapping experiments in wild vertebrate populations. The main reasons for using SNPs over (or in addition to) microsatellites in mapping studies are that (i) they can be typed on a much larger scale and (ii) they are much more abundant, meaning that any genomic location can be analysed.
Our research groups have been using SNPs for mapping experiments for approximately 5 years, and we have witnessed a dramatic change in the ways in which SNPs can be identified, genotyped and analysed. The aim of this article is to provide an overview of some of the methods, problems and pitfalls we have encountered during this period, which we hope will act as a guide to others wishing to carry out similar projects. We are mostly interested in using SNPs to map genes relevant underlying traits under selection in unmanipulated, pedigreed vertebrate populations and refer the reader to other reviews for the underlying rationale behind this work (Slate 2005; Ellegren and Sheldon 2008; Kruuk et al. 2008). The main areas we discuss are: methods of SNP discovery, SNP typing, and analyses of SNP data in mapping studies. We also discuss the feasibility of performing genome-wide association studies in wild populations using many thousands of markers. Examples from our own research laboratories are used to compare alternative methods and approaches, but the points we address are generally applicable to other laboratories, other taxonomic groups and other evolutionary questions. In particular the described method are relevant to the QTL or population genomics approaches to the identification of loci involved in population divergence, reproductive isolation and speciation (sensu Rogers and Bernatchez 2005), a topic that is addressed in another paper in this volume (Butlin 2008).
Methods for SNP discovery
There are many different methods of SNP detection available to molecular ecologists studying non-model organisms. Broadly these can be divided into two categories; (i) sequencing of targeted individual genomic regions and (ii) random sequencing of genomic regions, followed by identification of segregating SNPs. The two strategies are complementary, rather than competing and we have used both approaches in the course of our research.
EPIC and related approaches
The EPIC approach is now beginning to be employed specifically for gene mapping projects, both for genome scans, and studies that focus on specific genes. For genome or chromosome wide scans, genomic resources in closely related model species can be used to design primers that are approximately evenly spaced (based on their predicted locations in the related model organism). Thus, a panel of SNPs can be identified that are suitable for linkage map construction; an approach that has most notably been employed in studies of wild passerine bird populations, where the sequenced chicken genome can be used as a comparative genomics reference (Backström et al. 2006a, 2008; Hale et al. 2008).
EPIC has also been employed for designing SNPs in genes that are the focus of a candidate gene study. For example, Gratten et al. (2007) identified SNPs in five genes regarded as candidates for a coat colour polymorphism in a free-living population of Soay sheep. SNPs were initially identified in intronic regions, because it was reasoned that they would be most prevalent in introns of each candidate, and would likely be in linkage disequilibrium (LD) with causative mutations whether they were coding or regulatory. An intronic SNP in the gene Tyrosinase related protein 1 (Tyrp1) was found to be associated with coat colour, and linkage mapping of the gene confirmed it co-localised to the coat colour locus, which had been mapped to a small region of sheep chromosome 2 with a genome-wide panel of microsatellites (Beraldi et al. 2006). Subsequent sequencing of the Tyrp1 coding region identified a non-synonymous substitution in a highly conserved site, which is probably causative for the polymorphism (Gratten et al. 2007).
In summary, SNP discovery by EPIC sequencing is a reliable method that can be used to target specific gene regions. The main disadvantage of this approach is that it is a relatively laborious method as each locus has to be investigated individually.
The second main approach to SNP discovery involves sequencing of random genomic fragments in a limited number of individuals, followed by SNP discovery and validation. Some success has been achieved by capillary sequencing of random clones from genomic DNA libraries. For example, Rosenblum et al. (2007) identified 158 SNPs in a lizard species while Lin et al. (2007) used a similar approach to identify >40 SNPs in a bird species. A related approach was adopted by Adams et al. (2006) who identified SNPs in clones that were originally sequenced as part of a microsatellite library construction.
One method with the potential to rapidly generate large numbers of SNPs is to examine existing expressed sequence tag (EST) databases for putative SNPs. Provided sufficient numbers of EST are available for redundant sequences to be aligned, it is possible to identify SNPs in silico using a number of different computer programs such as PolyBayes (Marth et al. 1999), AutoSNP (Barker et al. 2003), SNPDetector (Zhang et al. 2005), PolyScan Chen et al. (2007) and QualitySNP (Tang et al. 2006). This approach to SNP discovery has been used in humans (Irizarry et al. 2000), model organisms (Fahrenkrug et al. 2002; Schmid et al. 2003; Stone et al. 2002) and more recently in non-model systems such as polar leaf rust (Feau et al. 2007) and Bicyclus butterflies Beldade et al. (2006).
Relative comparison of SNP discovery methods
Existing EST databases
Cost per SNP
Intronic or exonic
Methods for SNP typing
In much the same way that SNP detection methods vary depending on the scale of the project, there are a large number of alternative SNP typing strategies, the relative suitability of which depends on the number of loci and individuals that require typing.
Methods used in our group
Comparison of three SNP typing methods
Sequencer, robotics, two laboratories
Beadstation, unless outsourced
In-house or Outsourceda
Cost per locus
Very rapid, excellent service
For in-house SNP genotyping we have developed an allele-specific PCR-based method termed SNP-SCALE (Hinten et al. 2007) that uses locked nucleic acids (LNAs) at the 3’-SNP positions of primers to enhance allele specificity. This method does not require specialist equipment (we use the same capillary sequencer set-up for screening microsatellites, AFLPs, SNPs and DNA sequencing) and it is flexible in terms of the number of loci and individuals that can be typed. The SNP-SCALE method has recently been refined and extended such that multiplexing of 25–30 loci is now possible (Kenta et al. 2008).
Effect on IBD coefficient estimation of adding SNPs to a microsatellite map
Increase in variance relative to msats (%)
Proportion of theoretical maximum variance (%)c
1 (224.5 cM)
2 (249.5 cM)
3 (279.5 cM)
The third approach we have taken for SNP genotyping is to outsource genotyping to a service provider (in our case the GoldenGate platform provided by Illumina). This system, which requires specialist equipment (a ‘beadstation’), is cost-effective and rapid provided large numbers of SNPs (384 or more) are typed. Goldengate uses allele-specific extension followed by PCR to assay SNPs. PCR products are bound to beads on a Sentrix® microarray which is then read by the beadstation. We have used this approach to type 1536 zebra finch SNPs identified in silico, of which 1298 (85%) could be typed. There was a 97% call rate among genotyped loci, 100% reproducibility and 99.8% Mendelian inheritance consistency (Stapley et al. 2008). These figures are slightly lower than the manufacturer’s advertised benchmarks (93% conversion rate and 99.9% call rate), although some DNA samples in our mapping panel came from material known to be of low quality and/or quantity. Generally, one should expect that DNA obtained from natural populations will often be of lower quality than is typically used in studies from model organisms or human subjects, because sampling may have taken place in difficult conditions, there may have been a delay between sampling and extraction, DNA may have been archived in freezers for considerable periods and small amounts of material may have been sampled. Although it would be facile to make a direct comparison between our SNPlex and GoldenGate data, because different samples and loci were compared, we believe the data obtained with GoldenGate were of higher quality.
There are alternative methods for typing the numbers of loci that might be required for mapping studies. We do not have experience with these alternatives, so do not discuss them further. However, popular medium-throughput methods, many of which are offered by service providers, include the Beckman SNPStream (12 or 48-plex) platform Bell et al. (2002) and the Sequenom iPLex (up to 40-plex) assay, performed on the MassArray platform Buetow et al. (2001).
In summary, we find SNP-SCALE to be an excellent method when small-medium numbers of SNPs need to be typed (often in a large number of individuals), while outsourcing to providers of the GoldenGate platform works better for larger numbers of SNPs. Typically, large numbers (100 s) of SNPs might be typed when performing an initial linkage scan, while more modest numbers might be typed when performing association studies on a more limited number of genomic regions; see for example Gratten et al. (2007).
Prior to the collection of SNP genotype data there are a number of analytical questions that need to be addressed. How many SNPs are required to build genetic linkage maps (given the size of a mapping panel, the size of a genome and the marker density required)? How does one use SNP data to detect QTL by linkage mapping? How does one use SNP data to examine whether variation at candidate genes explains trait variation? Can adding SNPs to a map improve the power and resolution of QTL detection? In this last section we consider these questions, using empirical examples wherever possible. However, it must be remembered that SNPs have only been applied to mapping projects in a handful of natural populations to date, and further data are required before general conclusions can be reached.
Building linkage maps with SNPs
To date most linkage maps of wild populations have been constructed using microsatellites (Beraldi et al. 2006; Hansson et al. 2005; Slate et al. 2002), as their high levels of genetic variability mean they are informative about whether recombination has (or has not) occurred between markers during meiosis. Of course, SNPs are less variable and therefore a larger number of loci are required to construct linkage maps. However, simulation studies show that SNPs at 2 cM (and probably larger) intervals are able to produce robust and accurate linkage maps in typical pedigrees of wild populations (Slate 2008). Maps built entirely from SNPs have been used to map the Z chromosome of the collared flycatcher Ficedulaalbicollis (Backström et al. 2006a), while maps combining microsatellites and SNPs have been used to study the homologue of chicken chromosome 7 in various passerine birds (Hale et al. 2008). Although a higher density of SNPs than microsatellites is required to map genomes, this constraint is unlikely to be a problem as high-throughput typing becomes the norm. Furthermore, newer SNP typing technologies have error rates that are considerably lower than those of microsatellites (indeed, some SNP platforms have error rates lower than the mutation rate of some microsatellites), and so map error or map inflation due to typing error are likely to be less problematic than is the case for microsatellites (Slate 2008). Resolution of map errors caused by genotyping error can be a frustrating and time-consuming process. Prior to performing SNP-based map construction it is certainly worth performing simulation studies to ensure that marker density will be sufficiently high to detect linkage between syntenic markers. For example, marker data segregating in a mapping panel can be simulated with predetermined variability and chromosomal positions, with software such as SimPed (Leal et al. 2005).
Does the addition of SNPs to microsatellite maps help detect QTL?
QTL mapping studies conducted in natural populations to date (Beraldi et al. 2007a, b; Slate et al. 2002) have used evenly spaced microsatellite markers, typically at low density (e.g. 15 cM intervals), and then employed a two-step variance component approach to QTL detection (George et al. 2000; Slate 2005). The first stage in this process is to estimate the proportion of alleles that are identical-by-descent (IBD) ateverygenomiclocation that is to be tested for a QTL e.g. at 2 cM intervals. This means that IBD coefficients are estimated from markers that may be some distance (5–10 cM) from the test location. The second step is to fit the IBD matrix as a random effect in an ‘animal model’ (a form of linear mixed model widely used in quantitative genetics). Studies to date have reported QTL of marginal genomewide significance (Beraldi et al. 2007b; Slate et al. 2002). In principle, typing additional markers in a region can improve IBD estimates at putative QTL locations. These improved estimates should enhance the power to detect (or disprove) QTL, as well as providing more accurate estimates of QTL position and magnitude.
How to model SNPs in mapping studies?
Linkage mapping in natural populations by the two-step variance components method outlined above involves fitting the estimated IBD matrix as a random effect in a mixed effects linear model (George et al. 2000; Slate 2005). The variance component associated with the IBD matrix gives an estimate of QTL magnitude, and its statistical significance is assessed by likelihood ratio tests (by making a comparison to a model with the QTL random effect excluded). Several points are perhaps not immediately obvious until this type of analysis has been performed. First, QTL effects are only reported as a proportion of trait variation explained; mean trait values can not be assigned to individual alleles or genotypes. Second, only additive genetic effects at the QTL are estimated; this is in contrast to least squares linear regression or maximum likelihood approaches used in F2 crosses, where additive and dominance effects can be estimated (Haley and Knott 1992; Haley et al. 1994).
An alternative approach to detecting QTL with SNPs is to fit a SNP genotype as a fixed effect in a linear model (or in a mixed effects ‘animal model’ where polygenic effects are accounted for as a random effect). Although this approach is intuitively appealing (as the mean value of each genotype can be evaluated) it is highly prone to Type 1 error as population stratification can yield false associations between genotype and phenotype. One scenario where this approach may be justified is when a handful of candidate genes are being evaluated for linkage to a single locus trait, and associations can be tested by Fisher’s Exact Test or other contingency table type tests. However, it is still preferable to confirm putative associations by linkage analysis (e.g. Gratten et al. 2007).
One feature of linkage analysis is that test locations need not be particularly close (>1 cM) to a causative mutation, yet they are still able to detect an association to a trait of interest. This is because linkage analysis is sensitive to recombination events between the marker and the causative locus within the mapping pedigree members only, while association studies are sensitive to historical recombination events that pre-date the pedigree. Although this means that low density marker coverage can detect linkage, it also means the confidence interval surrounding a QTL is wide. This problem can be remedied by typing additional SNPs around a candidate region and then performing association studies. This approach has rarely been taken in natural populations, although one exception is reported by Gratten et al. (2008). By performing transmission disequilibrium tests or TDTs (Hernandez-Sanchez et al. 2003), Gratten et al. simultaneously tested for linkage and linkage disequilibrium between a SNP and a locus (or loci) affecting both body size and lifetime fitness in Soay sheep. By testing for linkage as well as LD the problem of false positives in association mapping studies is remedied. Current methods for performing TDTs in general pedigrees are computationally quite demanding, although methods suited to this type of analysis continue to be refined (Chen and Abecasis 2006, 2007).
The future—whole genome association analyses in wild populations
The advent of ultra-high throughput sequencing means that it will be possible to discover tens of thousands of SNPs in ecological organisms. In principle, and depending on the extent of linkage disequilibrium in the genome, it would then be possible to perform typing of many thousands of SNPs in a large enough number of individuals to perform whole-genome association mapping without first conducting linkage mapping. Several platforms now exist for typing thousands of SNPs on a chip (Gunderson et al. 2005; Hardenbol et al. 2005; Syvanen 2001). Although these have mostly been developed for humans (e.g. the Affymetrix GeneChip® 500 k array set, the Illumina Human1 M BeadChip) or model organisms (e.g. Illumina’s canine SNP20 and bovine SNP50 Beadchips with >20,000 and >50,000 SNPs respectively), the technology can be used in any organism. Excitingly, both of the main providers of SNP chips offer the opportunity to develop customised panels (up to 60,000 SNPs on the Illumina Infinium iSelect platform) and up to 10,000 SNPs per kit, with the opportunity for construction of multiple kits, on the Affymetrix GeneChip® system. At present, the idea of typing 10 s or even 100 s of thousands of SNPs in wild populations may seem fanciful, but studies of this kind will shortly be upon us. For example, a 60 k domestic sheep SNP chip will be available in 2008, and preliminary data suggest that two thirds of the SNPs will be segregating in a wild Soay sheep population, which will likely be typed on this platform shortly.
How many SNPs for genome-wide association mapping?
When studies of linkage disequilibrium were first carried out in humans it soon became apparent that regions of high linkage disequilibrium (haploblocks) were prevalent throughout the genome (Goldstein 2001; Reich et al. 2001; Stephens et al. 2001; Weiss and Clark 2002); these blocks are usually separated by recombination hotspots. Typing many SNPs from the same haploblocks is redundant for genome-wide association scans, and so a better strategy is to type the minimal number of SNPs that describe the main haplotypes within each block (so-called tagSNPs). Considerable efforts have been taken to optimise strategies for tagSNP selection in humans (Carlson et al. 2004; Zhang et al. 2002), and the larger (250–500 K) SNP chip arrays have sufficient power to identify disease-causing variants by association mapping (Docherty et al. 2007). Work is underway to estimate LD in other organisms (Aerts et al. 2007; Heifetz et al. 2005; Morrell et al. 2005; Nordborg et al. 2002; Nsengimana et al. 2004; Remington et al. 2001; Sutter et al. 2004), which can then be used to estimate how many tagSNPs are required for genome-wide association mapping. For example, two estimates from dairy cattle suggest that just 30–100 k SNPs will suffice (Khatkar et al. 2007; McKay et al. 2007, mainly because LD extends long distances in cattle (Farnir et al. 2000).
If researchers studying wild populations are to conduct whole genome association studies using SNP chips then a first step is to measure the extent of linkage disequilibrium in the genomes of wild populations. Studies of this type are in their infancy (Backström et al. 2006b; Slate and Pemberton 2007), but are essential to evaluate how many tagSNPs are required to perform genome-wide association scans. The cost of studies of this type can be substantially reduced if pooling of individuals from extremes of a trait distribution can be performed and SNP allele frequencies estimated from the pools (Macgregor et al. 2008). One attractive feature of association studies is that pedigrees are not necessary, so potentially a larger number of wild populations will be amenable to this type of analysis.
SNPs are now being used for a number of different applications in molecular ecology research, including gene mapping. Advances in DNA sequencing and typing technologies mean that mapping studies are now feasible in any non-model organism for which adequate phenotypic or life history data are available. Indeed, the biggest challenge in gene mapping studies in the wild is the painstaking collection of field data, as there are no technology-driven shortcuts to this component of the work. In the next 5 years we expect to see more mapping projects being carried out in pedigreed wild populations, although we caution that moving from QTL detection to identification of the actual underlying gene or mutation will be very difficult. Therefore, researchers should carefully consider what they want to get from a mapping project before embarking on one. Simple detection of a QTL and reporting of its magnitude may not reveal much about fitness variation or microevolution in the wild. However, mapping does have the potential to build on the quantitative genetic studies conducted to date, including yielding a greater understanding of the architecture of genetic correlations and gene by environment interaction. Furthermore, if causative SNPs (or SNPs in near-perfect LD with a causative SNP) can be found, it will be possible to combine population genetic and quantitative genetic approaches to studying fitness variation, such that selection on underlying genotypes can be identified sensu Gratten et al (2008). These are exciting times for researchers studying the genetics of wild populations, and we eagerly await the findings of further mapping projects.
This article was prepared for a workshop on Ecological Genomics that was organised by Jacob Höglund and Gernot Segelbacher, and funded by the European Science Foundation (ESF). The authors have benefitted from insightful discussion on this and related topics with Terry Burke, Peter Visscher, Gavin Hinten and Allan McRae. Peter Visscher made the suggestion to study the variance in halfsib IBD coefficients as an indicator of marker informativeness.