Genetica

, Volume 136, Issue 1, pp 97–107

Gene mapping in the wild with SNPs: guidelines and future directions

Authors

    • Department of Animal & Plant SciencesUniversity of Sheffield
  • Jake Gratten
    • Department of Animal & Plant SciencesUniversity of Sheffield
  • Dario Beraldi
    • Institute of Evolutionary BiologyUniversity of Edinburgh
  • Jessica Stapley
    • Department of Animal & Plant SciencesUniversity of Sheffield
  • Matt Hale
    • Department of Animal & Plant SciencesUniversity of Sheffield
    • Department of Forestry and Natural ResourcesPurdue University
  • Josephine M. Pemberton
    • Institute of Evolutionary BiologyUniversity of Edinburgh
Article

DOI: 10.1007/s10709-008-9317-z

Cite this article as:
Slate, J., Gratten, J., Beraldi, D. et al. Genetica (2009) 136: 97. doi:10.1007/s10709-008-9317-z
  • 758 Views

Abstract

One of the biggest challenges facing evolutionary biologists is to identify and understand loci that explain fitness variation in natural populations. This review describes how genetic (linkage) mapping with single nucleotide polymorphism (SNP) markers can lead to great progress in this area. Strategies for SNP discovery and SNP genotyping are described and an overview of how to model SNP genotype information in mapping studies is presented. Finally, the opportunity afforded by new generation sequencing and typing technologies to map fitness genes by genome-wide association studies is discussed.

Keywords

Gene discoveryQTLLinkageMappingSNPWild population

Introduction

Longitudinal studies of wild animal populations have proven invaluable for studying selection, genetic architecture and microevolution of fitness-related traits, especially since the widespread uptake of the ‘animal model’ approach to quantitative genetic parameter estimation (Kruuk 2004; Kruuk and Hill 2008; Merilä et al. 2001). Unfortunately, identifying the actual genes responsible for variation in fitness has proven more difficult, even though the statistical framework and some appropriate study populations have been available for some time (Slate 2005). To date, there are relatively few examples of quantitative trait locus (QTL) studies being conducted in unmanipulated, wild populations (Beraldi et al. 2007a, b; Slate et al. 2002) and these studies have only identified approximate locations of a limited number of QTL. However, there is now a great opportunity to synthesise gene discovery with quantitative genetic studies of wild populations due to the increasing ease (and decreasing cost) with which genomics studies can now be conducted in non-model organisms (Ellegren 2008; Ellegren and Sheldon 2008).

To date, QTL mapping studies in the wild have all been conducted by typing a suite of microsatellite markers, originally identified in closely related model organisms. Although microsatellites have a number of well-documented properties that make them excellent for molecular ecology research (Jarne and Lagoda 1996), they are not ideal for gene mapping. The main limitation of microsatellites is that typing methods are not highly automated; it is difficult to type more than about ten loci in a single reaction. Relative to other markers they are not as abundant in the genome, and marker discovery has traditionally been time-consuming, especially when large numbers of loci are required. There is a suspicion that previous mapping studies in wild populations have approached the limits of what can be realistically achieved with microsatellites and a pedigree of several hundred individuals; i.e. low marker density genome scans that yield crude estimates of QTL location and magnitude.

Single nucleotide polymorphisms (SNPs) are the most abundant type of genetic polymorphism in most, if not all, genomes. In recent years SNPs have attracted growing interest from researchers who have recognised their potential for addressing a number of outstanding questions in evolutionary biology and ecology (Luikart et al. 2003; Morin et al. 2004; Nielsen 2005). The advantages (and disadvantages) of SNPs relative to other types of molecular marker have been reviewed elsewhere (Morin et al. 2004), and it is not the aim of this article to duplicate that material. However, a relatively new application of SNPs is as a tool for carrying out gene mapping experiments in wild vertebrate populations. The main reasons for using SNPs over (or in addition to) microsatellites in mapping studies are that (i) they can be typed on a much larger scale and (ii) they are much more abundant, meaning that any genomic location can be analysed.

Our research groups have been using SNPs for mapping experiments for approximately 5 years, and we have witnessed a dramatic change in the ways in which SNPs can be identified, genotyped and analysed. The aim of this article is to provide an overview of some of the methods, problems and pitfalls we have encountered during this period, which we hope will act as a guide to others wishing to carry out similar projects. We are mostly interested in using SNPs to map genes relevant underlying traits under selection in unmanipulated, pedigreed vertebrate populations and refer the reader to other reviews for the underlying rationale behind this work (Slate 2005; Ellegren and Sheldon 2008; Kruuk et al. 2008). The main areas we discuss are: methods of SNP discovery, SNP typing, and analyses of SNP data in mapping studies. We also discuss the feasibility of performing genome-wide association studies in wild populations using many thousands of markers. Examples from our own research laboratories are used to compare alternative methods and approaches, but the points we address are generally applicable to other laboratories, other taxonomic groups and other evolutionary questions. In particular the described method are relevant to the QTL or population genomics approaches to the identification of loci involved in population divergence, reproductive isolation and speciation (sensu Rogers and Bernatchez 2005), a topic that is addressed in another paper in this volume (Butlin 2008).

Methods for SNP discovery

There are many different methods of SNP detection available to molecular ecologists studying non-model organisms. Broadly these can be divided into two categories; (i) sequencing of targeted individual genomic regions and (ii) random sequencing of genomic regions, followed by identification of segregating SNPs. The two strategies are complementary, rather than competing and we have used both approaches in the course of our research.

EPIC and related approaches

The acronym EPIC refers to Exon-Priming, Intron-Crossing primers which are used to PCR amplify intronic regions of genes (Fig. 1). The idea behind the approach is that, in the absence of sequence data for the focal organism, PCR primers can still be designed by performing sequence alignment of exonic sequences from other species in the same or related taxa. If the two primers are designed in adjacent exons, then the intron they flank can be amplified and sequenced (usually bi-directionally by capillary sequencing). This approach to SNP discovery in non-model organisms is a reliable method that can be employed in most taxa (Aitken et al. 2004; Lyons et al 1997; Palumbi 1996). The EPIC method works best when DNA sequence data are available for protein-coding regions of genes from organisms closely related to the focal species (for recent examples see Cappuccio et al. 2006; Elfstrom et al. 2006; Morin et al. 2007; for a bioinformatics pipeline applicable to plants see Fredslund et al 2006). There are several variants on the EPIC approach, including amplification of exonic rather than intronic sequence (Elfstrom et al. 2007; Ryynanen and Primmer 2004). The main reason to sequence exons is that nonsynonymous substitutions can be identified, and these are potentially functionally important. The main disadvantage of sequencing exons is that they tend to have lower levels of nucleotide diversity than intronic sequence, as they are typically under greater functional constraint. Thus, the decision on whether to sequence exons or introns may be determined by the initial questions that are being addressed (exonic SNPs may be preferred if hypotheses specific to that gene are being investigated; intronic SNPs may be preferred if a suite of neutral markers are required for e.g., constructing a linkage map). One interesting variant on the EPIC strategy was employed in salmon by Ryynanen and Primmer (2006) who designed primers in intronic regions to amplify exons i.e. IPEC primers. The reason for adopting this ‘reverse strategy’ is that salmonid genomes contain large numbers of duplicated genes, and by designing primers in less conserved introns the problem of non-specific amplification of target genes was reduced.
https://static-content.springer.com/image/art%3A10.1007%2Fs10709-008-9317-z/MediaObjects/10709_2008_9317_Fig1_HTML.gif
Fig. 1

SNP detection using EPIC primers. Note: DNA sequence data from a specified gene are aligned for two (or more) organisms related to the focal species. Here, intronic and exonic sequence has been obtained from organism 1, but only exonic sequence from organism 2. Exons that flank an intron of sequencable length (e.g. 500-800 bp) are identified, and regions of high sequence conservation are chosen for primer design. The rationale is that primers in these regions have the maximal chance of amplification success in the focal species. The intron (which is assumed to be of similar size in the focal species) is amplified by PCR and, following PCR product purification, sequenced in both directions in a small number of individuals. Putative SNPs are identified in the amplified intronic sequence and validated in a larger number of individuals

The EPIC approach is now beginning to be employed specifically for gene mapping projects, both for genome scans, and studies that focus on specific genes. For genome or chromosome wide scans, genomic resources in closely related model species can be used to design primers that are approximately evenly spaced (based on their predicted locations in the related model organism). Thus, a panel of SNPs can be identified that are suitable for linkage map construction; an approach that has most notably been employed in studies of wild passerine bird populations, where the sequenced chicken genome can be used as a comparative genomics reference (Backström et al. 2006a, 2008; Hale et al. 2008).

EPIC has also been employed for designing SNPs in genes that are the focus of a candidate gene study. For example, Gratten et al. (2007) identified SNPs in five genes regarded as candidates for a coat colour polymorphism in a free-living population of Soay sheep. SNPs were initially identified in intronic regions, because it was reasoned that they would be most prevalent in introns of each candidate, and would likely be in linkage disequilibrium (LD) with causative mutations whether they were coding or regulatory. An intronic SNP in the gene Tyrosinase related protein 1 (Tyrp1) was found to be associated with coat colour, and linkage mapping of the gene confirmed it co-localised to the coat colour locus, which had been mapped to a small region of sheep chromosome 2 with a genome-wide panel of microsatellites (Beraldi et al. 2006). Subsequent sequencing of the Tyrp1 coding region identified a non-synonymous substitution in a highly conserved site, which is probably causative for the polymorphism (Gratten et al. 2007).

In summary, SNP discovery by EPIC sequencing is a reliable method that can be used to target specific gene regions. The main disadvantage of this approach is that it is a relatively laborious method as each locus has to be investigated individually.

Random sequencing

The second main approach to SNP discovery involves sequencing of random genomic fragments in a limited number of individuals, followed by SNP discovery and validation. Some success has been achieved by capillary sequencing of random clones from genomic DNA libraries. For example, Rosenblum et al. (2007) identified 158 SNPs in a lizard species while Lin et al. (2007) used a similar approach to identify >40 SNPs in a bird species. A related approach was adopted by Adams et al. (2006) who identified SNPs in clones that were originally sequenced as part of a microsatellite library construction.

One method with the potential to rapidly generate large numbers of SNPs is to examine existing expressed sequence tag (EST) databases for putative SNPs. Provided sufficient numbers of EST are available for redundant sequences to be aligned, it is possible to identify SNPs in silico using a number of different computer programs such as PolyBayes (Marth et al. 1999), AutoSNP (Barker et al. 2003), SNPDetector (Zhang et al. 2005), PolyScan Chen et al. (2007) and QualitySNP (Tang et al. 2006). This approach to SNP discovery has been used in humans (Irizarry et al. 2000), model organisms (Fahrenkrug et al. 2002; Schmid et al. 2003; Stone et al. 2002) and more recently in non-model systems such as polar leaf rust (Feau et al. 2007) and Bicyclus butterflies Beldade et al. (2006).

EST libraries are currently unavailable for most of the species that are the focus of pedigree-based longitudinal population studies. However, the advent of ultrahigh-throughput sequencing technologies, such as 454 pyrosequencing (Hudson 2008; Mardis 2008; Margulies et al. 2005) makes SNP discovery feasible in any species. Here, the idea is that many genomic regions can be sequenced at very high coverage, usually by outsourcing to a service provider. Following contig assembly, it is then possible to identify SNPs from overlapping sequences (Fig. 2). The high-throughput sequencing method can be carried out on complementary DNA (cDNA) synthesised from messenger RNA by reverse transcriptase, in which case SNPs within the transcriptome will be identified (Vera et al. 2008), or from genomic DNA such that SNPs from all over the genome will be reported. The key to identifying SNPs is that several individuals must be sequenced with high sequence redundancy to detect segregating sites. For example, if we assume that a single run of a 454 sequencer can generate 400,000 sequences of 250 bp, and that an organism has a genome of 2 Gbp of which 100 Mbp is transcribed, then 10 runs would produce 10-fold coverage of the transcriptome, but 200 runs would be required to achieve similar coverage of the genome. At current costs of ~€12 k per run, 10 runs is within the budget of a medium-large research grant, but 200 runs is probably not. That said, sequencing costs continue to fall very rapidly, and other technologies have even greater throughput than the 454 GS-FLX system, and offer great potential for SNP discovery if reference genomes are available (e.g. Van Tassell et al. 2008). We have used 454 sequencing of cDNA and the QualitySNP pipeline (Tang et al. 2006) to identify several thousand SNPs in the zebra finch transcriptome (Stapley et al. 2008). Assays were designed for 1536 SNPs that were detected in silico, of which 1298 (84.5%) were confirmed, indicating that high conversion rates of putative SNP to scoreable segregating SNPs can be achieved. More importantly, this approach yields large numbers of SNPs rapidly and (per SNP) cheaply. Methods for SNP detection from 454 sequence data are still being refined as the short reads present a challenge to software designed for SNP detection from EST databases, especially if no reference genome assembly is available. However, software for detecting SNPs from short sequence reads is now appearing (e.g. Quinlan et al. 2008).
https://static-content.springer.com/image/art%3A10.1007%2Fs10709-008-9317-z/MediaObjects/10709_2008_9317_Fig2_HTML.gif
Fig. 2

SNP discovery process with 454 sequence data. Note: Many thousands of short sequence reads are obtained from a small number of individuals. Overlapping sequences are identified and aligned computationally, to build contigs. Redundant contigs are then compared and aligned, and segregating SNPs are identified

A comparison of the advantages and disadvantages of the alternative approaches to SNP detection are outlined in Table 1 We provide qualitative rather than actual estimates as the relative costs of EPIC-based methods tend to vary between laboratories, depending on the infrastructure. Furthermore, prices of high-throughput sequencing are changing so rapidly that any figure reported here will be redundant in the near future. Importantly, although high-throughput methods offer many advantages, there are some scenarios for which EPIC type approaches are preferable, most obviously when a small number of specific genes are under investigation. Therefore, it remains useful to retain the capacity to identify SNPs even if the growing trend of outsourcing large-scale laboratory work to service providers continues to gain momentum.
Table 1

Relative comparison of SNP discovery methods

 

EPIC

Existing EST databases

454

Scale

Small

Variable

Large

Cost per SNP

High

Low (zero)

Low

Initial outlay

Low

Low

High

Outsourceable

No

No

Yes

Intronic or exonic

Intronic (usually)

Exonic

Exonic (usually)

Targeted regions

Yes

No

Not usuallya

Cost, scale and outlay are relative and assume that SNP discovery will be carried out in-house. In silico SNP detection from EST databases is included, although this approach assumes an EST database for the focal organism already exists. In reality, this is unlikely for most species with pedigreed wild populations

aIt is possible to pool PCR products from many targeted EPIC regions and sequence them on a single 454 run

Methods for SNP typing

In much the same way that SNP detection methods vary depending on the scale of the project, there are a large number of alternative SNP typing strategies, the relative suitability of which depends on the number of loci and individuals that require typing.

Methods used in our group

We have used three different methods of typing SNPs. Their relative merits are summarised in Table 2 We are likely to continue with two of these into the medium term future, but one we have abandoned on grounds of assay complexity and relative cost. The main considerations when planning a SNP typing experiment are the number of loci and individuals that need to be typed. One then needs to compare the relative merits of outsourcing the work or performing it in-house.
Table 2

Comparison of three SNP typing methods

 

SNP-SCALE

SNPlex

Illumina

Min loci

1

24

384

Max loci

~30

48

1536

Expertise required

Modest

Considerable

None

Equipment required

Sequencer

Sequencer, robotics, two laboratories

Beadstation, unless outsourced

In-house or Outsourceda

In-house

In-house

Outsourced

Total cost

Low

High

High

Cost per locus

Medium

Medium

Low

Comments

Very flexible

Technically tricky

Very rapid, excellent service

aIt is possible to outsource SNPlex or to perform Illumina GoldenGate typing in-house, but these two platforms are most often regarded as in-house and outsourced methods respectively

For in-house SNP genotyping we have developed an allele-specific PCR-based method termed SNP-SCALE (Hinten et al. 2007) that uses locked nucleic acids (LNAs) at the 3’-SNP positions of primers to enhance allele specificity. This method does not require specialist equipment (we use the same capillary sequencer set-up for screening microsatellites, AFLPs, SNPs and DNA sequencing) and it is flexible in terms of the number of loci and individuals that can be typed. The SNP-SCALE method has recently been refined and extended such that multiplexing of 25–30 loci is now possible (Kenta et al. 2008).

An alternative medium-throughput typing technology is the Applied Biosystems SNPlex system (Tobler et al. 2005). SNPlex reactions involve two steps—allele specific oligonucleotide ligation (OLA), followed by PCR. Genotypes are resolved by electrophoresis, and upto 48 loci can be multiplexed. We attempted to type up to 64 putative SNPs in Soay sheep with SNPlex (Table 3) It was possible to design assays for 59 (92%) loci, of which 47 were assayed and 31 (66%) could be reliably scored. Thus, around 60% of loci identified in silico could be converted to useful genotype data, although the manufacturers claim conversion rates in excess of 80%. Of course, one should expect conversion rates to be lower in wild populations than in model organisms, although the considerable genomic resources for closely related domestic sheep and cattle mean that conversion rates for Soay sheep may be higher than for most non-model organisms. We also found SNPlex to be a technically difficult method. It is recommended that the OLA and PCR steps are performed in different laboratories to avoid typing error/contamination, and the method is manually demanding without liquid-handling robotics. We also found that SNPlex performed poorly on our more degraded or low concentration samples, which may be a consideration to other researchers collecting samples in the field.
Table 3

Effect on IBD coefficient estimation of adding SNPs to a microsatellite map

Positiona

Dataset

IBD meanb

IBD variance

Increase in variance relative to msats (%)

Proportion of theoretical maximum variance (%)c

1 (224.5 cM)

Microsats only

0.254

0.027

  
 

All loci

0.255

0.039

42.6

62.6

2 (249.5 cM)

Microsats only

0.253

0.036

  
 

All loci

0.257

0.052

45.4

83.7

3 (279.5 cM)

Microsats only

0.257

0.034

  
 

All loci

0.256

0.047

37.3

74.9

aData are provided from three positions on Soay sheep chromosome 3 (see Fig. 3)

bMean and variance of IBD coefficients between half-sibs in the Soay sheep mapping panel are reported

cThe theoretical maximum variance is 0.0625. Adding SNPs to the map improves IBD estimation, and therefore power to detect QTL

The third approach we have taken for SNP genotyping is to outsource genotyping to a service provider (in our case the GoldenGate platform provided by Illumina). This system, which requires specialist equipment (a ‘beadstation’), is cost-effective and rapid provided large numbers of SNPs (384 or more) are typed. Goldengate uses allele-specific extension followed by PCR to assay SNPs. PCR products are bound to beads on a Sentrix® microarray which is then read by the beadstation. We have used this approach to type 1536 zebra finch SNPs identified in silico, of which 1298 (85%) could be typed. There was a 97% call rate among genotyped loci, 100% reproducibility and 99.8% Mendelian inheritance consistency (Stapley et al. 2008). These figures are slightly lower than the manufacturer’s advertised benchmarks (93% conversion rate and 99.9% call rate), although some DNA samples in our mapping panel came from material known to be of low quality and/or quantity. Generally, one should expect that DNA obtained from natural populations will often be of lower quality than is typically used in studies from model organisms or human subjects, because sampling may have taken place in difficult conditions, there may have been a delay between sampling and extraction, DNA may have been archived in freezers for considerable periods and small amounts of material may have been sampled. Although it would be facile to make a direct comparison between our SNPlex and GoldenGate data, because different samples and loci were compared, we believe the data obtained with GoldenGate were of higher quality.

There are alternative methods for typing the numbers of loci that might be required for mapping studies. We do not have experience with these alternatives, so do not discuss them further. However, popular medium-throughput methods, many of which are offered by service providers, include the Beckman SNPStream (12 or 48-plex) platform Bell et al. (2002) and the Sequenom iPLex (up to 40-plex) assay, performed on the MassArray platform Buetow et al. (2001).

In summary, we find SNP-SCALE to be an excellent method when small-medium numbers of SNPs need to be typed (often in a large number of individuals), while outsourcing to providers of the GoldenGate platform works better for larger numbers of SNPs. Typically, large numbers (100 s) of SNPs might be typed when performing an initial linkage scan, while more modest numbers might be typed when performing association studies on a more limited number of genomic regions; see for example Gratten et al. (2007).

Analytical issues

Prior to the collection of SNP genotype data there are a number of analytical questions that need to be addressed. How many SNPs are required to build genetic linkage maps (given the size of a mapping panel, the size of a genome and the marker density required)? How does one use SNP data to detect QTL by linkage mapping? How does one use SNP data to examine whether variation at candidate genes explains trait variation? Can adding SNPs to a map improve the power and resolution of QTL detection? In this last section we consider these questions, using empirical examples wherever possible. However, it must be remembered that SNPs have only been applied to mapping projects in a handful of natural populations to date, and further data are required before general conclusions can be reached.

Building linkage maps with SNPs

To date most linkage maps of wild populations have been constructed using microsatellites (Beraldi et al. 2006; Hansson et al. 2005; Slate et al. 2002), as their high levels of genetic variability mean they are informative about whether recombination has (or has not) occurred between markers during meiosis. Of course, SNPs are less variable and therefore a larger number of loci are required to construct linkage maps. However, simulation studies show that SNPs at 2 cM (and probably larger) intervals are able to produce robust and accurate linkage maps in typical pedigrees of wild populations (Slate 2008). Maps built entirely from SNPs have been used to map the Z chromosome of the collared flycatcher Ficedulaalbicollis (Backström et al. 2006a), while maps combining microsatellites and SNPs have been used to study the homologue of chicken chromosome 7 in various passerine birds (Hale et al. 2008). Although a higher density of SNPs than microsatellites is required to map genomes, this constraint is unlikely to be a problem as high-throughput typing becomes the norm. Furthermore, newer SNP typing technologies have error rates that are considerably lower than those of microsatellites (indeed, some SNP platforms have error rates lower than the mutation rate of some microsatellites), and so map error or map inflation due to typing error are likely to be less problematic than is the case for microsatellites (Slate 2008). Resolution of map errors caused by genotyping error can be a frustrating and time-consuming process. Prior to performing SNP-based map construction it is certainly worth performing simulation studies to ensure that marker density will be sufficiently high to detect linkage between syntenic markers. For example, marker data segregating in a mapping panel can be simulated with predetermined variability and chromosomal positions, with software such as SimPed (Leal et al. 2005).

Does the addition of SNPs to microsatellite maps help detect QTL?

QTL mapping studies conducted in natural populations to date (Beraldi et al. 2007a, b; Slate et al. 2002) have used evenly spaced microsatellite markers, typically at low density (e.g. 15 cM intervals), and then employed a two-step variance component approach to QTL detection (George et al. 2000; Slate 2005). The first stage in this process is to estimate the proportion of alleles that are identical-by-descent (IBD) ateverygenomiclocation that is to be tested for a QTL e.g. at 2 cM intervals. This means that IBD coefficients are estimated from markers that may be some distance (5–10 cM) from the test location. The second step is to fit the IBD matrix as a random effect in an ‘animal model’ (a form of linear mixed model widely used in quantitative genetics). Studies to date have reported QTL of marginal genomewide significance (Beraldi et al. 2007b; Slate et al. 2002). In principle, typing additional markers in a region can improve IBD estimates at putative QTL locations. These improved estimates should enhance the power to detect (or disprove) QTL, as well as providing more accurate estimates of QTL position and magnitude.

We have examined whether the typing of additional SNPs improves the accuracy of IBD estimation in general pedigrees. A linkage mapping study in the St Kilda population of Soay sheep has been conducted using a panel of 250 microsatellites (Beraldi et al. 2007a, b). Here, we examine the mean and variance of IBD coefficients at three genomic regions of sheep chromosome 3, with and without the addition of SNP markers (Fig. 3). Within the Soay sheep mapping panel most of the power to detect QTL comes from half-sibs as full-sibs are rare in this population. At any given location half-sibs are expected to have an IBD coefficient of either 0 or 0.5 (expected mean = 0.25), with a variance of 0.0625 (Almasy and Blangero 1998). However, when marker information is imperfect, IBD estimates will not be as low as 0 or as high as 0.5, but instead will be closer to the mean value of 0.25, expected when there is no marker information (Visscher and Hopper 2001). Therefore, the variance will be reduced relative to the theoretical maximum. By comparing the mean and variance of IBD coefficients between half-sibs with and without the addition of SNPs it is possible to measure the extent to which additional markers enhance IBD coefficient estimation (Table 3) Adding a modest number of SNPs increased the IBD coefficient variance by 37–45%, with the three locations yielding variances between 63 and 83% of the theoretical maximum. In this population the addition of SNPs in targeted locations, once a low marker density genome scan has identified putative QTL, appears to be a useful strategy. Note that this is in contrast to other QTL mapping strategies such as interval mapping in backcross or F2 populations created from divergent lines, where marker spacing less than 10 cM makes little difference to power (Darvasi et al. 1993; Piepho 2000).
https://static-content.springer.com/image/art%3A10.1007%2Fs10709-008-9317-z/MediaObjects/10709_2008_9317_Fig3_HTML.gif
Fig. 3

Markers on Soay Sheep chromosome 3. Note: Microsatellite markers on the original (Beraldi et al. 2006) map are in plain font. SNPs subsequently added are in bold, underlined font. The mean and variance of IBD coefficient estimates were estimated at three positions (see Table 3)

How to model SNPs in mapping studies?

Linkage mapping in natural populations by the two-step variance components method outlined above involves fitting the estimated IBD matrix as a random effect in a mixed effects linear model (George et al. 2000; Slate 2005). The variance component associated with the IBD matrix gives an estimate of QTL magnitude, and its statistical significance is assessed by likelihood ratio tests (by making a comparison to a model with the QTL random effect excluded). Several points are perhaps not immediately obvious until this type of analysis has been performed. First, QTL effects are only reported as a proportion of trait variation explained; mean trait values can not be assigned to individual alleles or genotypes. Second, only additive genetic effects at the QTL are estimated; this is in contrast to least squares linear regression or maximum likelihood approaches used in F2 crosses, where additive and dominance effects can be estimated (Haley and Knott 1992; Haley et al. 1994).

An alternative approach to detecting QTL with SNPs is to fit a SNP genotype as a fixed effect in a linear model (or in a mixed effects ‘animal model’ where polygenic effects are accounted for as a random effect). Although this approach is intuitively appealing (as the mean value of each genotype can be evaluated) it is highly prone to Type 1 error as population stratification can yield false associations between genotype and phenotype. One scenario where this approach may be justified is when a handful of candidate genes are being evaluated for linkage to a single locus trait, and associations can be tested by Fisher’s Exact Test or other contingency table type tests. However, it is still preferable to confirm putative associations by linkage analysis (e.g. Gratten et al. 2007).

One feature of linkage analysis is that test locations need not be particularly close (>1 cM) to a causative mutation, yet they are still able to detect an association to a trait of interest. This is because linkage analysis is sensitive to recombination events between the marker and the causative locus within the mapping pedigree members only, while association studies are sensitive to historical recombination events that pre-date the pedigree. Although this means that low density marker coverage can detect linkage, it also means the confidence interval surrounding a QTL is wide. This problem can be remedied by typing additional SNPs around a candidate region and then performing association studies. This approach has rarely been taken in natural populations, although one exception is reported by Gratten et al. (2008). By performing transmission disequilibrium tests or TDTs (Hernandez-Sanchez et al. 2003), Gratten et al. simultaneously tested for linkage and linkage disequilibrium between a SNP and a locus (or loci) affecting both body size and lifetime fitness in Soay sheep. By testing for linkage as well as LD the problem of false positives in association mapping studies is remedied. Current methods for performing TDTs in general pedigrees are computationally quite demanding, although methods suited to this type of analysis continue to be refined (Chen and Abecasis 2006, 2007).

The future—whole genome association analyses in wild populations

Future methods

The advent of ultra-high throughput sequencing means that it will be possible to discover tens of thousands of SNPs in ecological organisms. In principle, and depending on the extent of linkage disequilibrium in the genome, it would then be possible to perform typing of many thousands of SNPs in a large enough number of individuals to perform whole-genome association mapping without first conducting linkage mapping. Several platforms now exist for typing thousands of SNPs on a chip (Gunderson et al. 2005; Hardenbol et al. 2005; Syvanen 2001). Although these have mostly been developed for humans (e.g. the Affymetrix GeneChip® 500 k array set, the Illumina Human1 M BeadChip) or model organisms (e.g. Illumina’s canine SNP20 and bovine SNP50 Beadchips with >20,000 and >50,000 SNPs respectively), the technology can be used in any organism. Excitingly, both of the main providers of SNP chips offer the opportunity to develop customised panels (up to 60,000 SNPs on the Illumina Infinium iSelect platform) and up to 10,000 SNPs per kit, with the opportunity for construction of multiple kits, on the Affymetrix GeneChip® system. At present, the idea of typing 10 s or even 100 s of thousands of SNPs in wild populations may seem fanciful, but studies of this kind will shortly be upon us. For example, a 60 k domestic sheep SNP chip will be available in 2008, and preliminary data suggest that two thirds of the SNPs will be segregating in a wild Soay sheep population, which will likely be typed on this platform shortly.

How many SNPs for genome-wide association mapping?

When studies of linkage disequilibrium were first carried out in humans it soon became apparent that regions of high linkage disequilibrium (haploblocks) were prevalent throughout the genome (Goldstein 2001; Reich et al. 2001; Stephens et al. 2001; Weiss and Clark 2002); these blocks are usually separated by recombination hotspots. Typing many SNPs from the same haploblocks is redundant for genome-wide association scans, and so a better strategy is to type the minimal number of SNPs that describe the main haplotypes within each block (so-called tagSNPs). Considerable efforts have been taken to optimise strategies for tagSNP selection in humans (Carlson et al. 2004; Zhang et al. 2002), and the larger (250–500 K) SNP chip arrays have sufficient power to identify disease-causing variants by association mapping (Docherty et al. 2007). Work is underway to estimate LD in other organisms (Aerts et al. 2007; Heifetz et al. 2005; Morrell et al. 2005; Nordborg et al. 2002; Nsengimana et al. 2004; Remington et al. 2001; Sutter et al. 2004), which can then be used to estimate how many tagSNPs are required for genome-wide association mapping. For example, two estimates from dairy cattle suggest that just 30–100 k SNPs will suffice (Khatkar et al. 2007; McKay et al. 2007, mainly because LD extends long distances in cattle (Farnir et al. 2000).

If researchers studying wild populations are to conduct whole genome association studies using SNP chips then a first step is to measure the extent of linkage disequilibrium in the genomes of wild populations. Studies of this type are in their infancy (Backström et al. 2006b; Slate and Pemberton 2007), but are essential to evaluate how many tagSNPs are required to perform genome-wide association scans. The cost of studies of this type can be substantially reduced if pooling of individuals from extremes of a trait distribution can be performed and SNP allele frequencies estimated from the pools (Macgregor et al. 2008). One attractive feature of association studies is that pedigrees are not necessary, so potentially a larger number of wild populations will be amenable to this type of analysis.

Concluding remarks

SNPs are now being used for a number of different applications in molecular ecology research, including gene mapping. Advances in DNA sequencing and typing technologies mean that mapping studies are now feasible in any non-model organism for which adequate phenotypic or life history data are available. Indeed, the biggest challenge in gene mapping studies in the wild is the painstaking collection of field data, as there are no technology-driven shortcuts to this component of the work. In the next 5 years we expect to see more mapping projects being carried out in pedigreed wild populations, although we caution that moving from QTL detection to identification of the actual underlying gene or mutation will be very difficult. Therefore, researchers should carefully consider what they want to get from a mapping project before embarking on one. Simple detection of a QTL and reporting of its magnitude may not reveal much about fitness variation or microevolution in the wild. However, mapping does have the potential to build on the quantitative genetic studies conducted to date, including yielding a greater understanding of the architecture of genetic correlations and gene by environment interaction. Furthermore, if causative SNPs (or SNPs in near-perfect LD with a causative SNP) can be found, it will be possible to combine population genetic and quantitative genetic approaches to studying fitness variation, such that selection on underlying genotypes can be identified sensu Gratten et al (2008). These are exciting times for researchers studying the genetics of wild populations, and we eagerly await the findings of further mapping projects.

Acknowledgements

This article was prepared for a workshop on Ecological Genomics that was organised by Jacob Höglund and Gernot Segelbacher, and funded by the European Science Foundation (ESF). The authors have benefitted from insightful discussion on this and related topics with Terry Burke, Peter Visscher, Gavin Hinten and Allan McRae. Peter Visscher made the suggestion to study the variance in halfsib IBD coefficients as an indicator of marker informativeness.

Copyright information

© Springer Science+Business Media B.V. 2008