Introduction

Forest tree populations consist of sessile, long-lived organisms which must survive temporally varying environmental conditions that are presently also affected by accelerated global climate change. Hence, the presence and maintenance of genetic variation at genes controlling adaptive traits is important for the long-term persistence and stability of forest tree populations in order to survive heterogeneous conditions.

Genetic markers are ultimately based on the variation of DNA sequences. However, the sequences of the currently most commonly used genetic markers in beech (Fagus sylvatica L.) are not directly observed and are usually unknown. Only particular aspects of the variation are investigated within PCR-amplified DNA fragments such as the number of tandem repeats in microsatellite motives (e.g., Pastorelli et al. 2003) or the absence or presence of restriction sites in amplified fragment length polymorphisms (AFLPs; e.g., Gailing and von Wühlisch 2004). The amplified genomic regions are usually either unknown such as in anonymous AFLPs or are located in non-coding regions of the DNA (most microsatellites). Accordingly, most of the variation at molecular DNA-based markers is assumed to be selectively neutral. In addition, the accurate scoring of microsatellite markers (SSRs) and AFLPs can be difficult due to PCR and electrophoresis artifacts, and the comparability between different laboratories is problematic. Although different SSR loci can be multiplexed for a higher throughput, the multiplex process is complex, not always successful, and limited to a low number of loci.

Furthermore, isozymes are also important markers. These biochemical markers can be used to assess genetic diversity at gene loci coding for enzymes, which serve important functions in the metabolism of plants. For beech, the genetic diversity has been studied extensively at selected isozyme gene loci (e.g., Müller-Starck and Ziehe 1991; Müller-Starck and Starke 1993). However, the analysis of isozymes allows to explore only a fraction of the underlying sequence variation, and only few gene loci coding for selected soluble enzymes can be investigated by means of enzyme electrophoresis. Furthermore, it is questionable whether isozymes are suitable to detect adaptive variation or if most of the markers are neutral (e.g., Eriksson 1998 and references therein).

Comparative sequencing is the ultimate method to detect variation within any DNA fragment. Today, it is already possible to analyze and to compare whole genomes of organisms using high-throughput sequencing technologies, also called next-generation sequencing. The most frequently used techniques at the moment are 454 (Roche), SOLiD (Life Technologies), and Illumina (e.g., reviewed by Glenn 2011, Deschamps and Campbell 2010). However, this method is still too expensive to analyze a sufficient number of individuals for population genetic studies or the study of adaptation in natural populations. For non-model organisms like trees, where most of the genomes are not sequenced yet, next-generation sequencing is not an established technique. Considering these limitations, the most promising markers for the study of adaptation at the moment are SNPs (single nucleotide poylmorphism), that is, the substitutions of only one nucleobase. SNPs are the most frequent variations found in DNA (Brookes 1999), and the analysis is not restricted to special enzymes. SNP marker can, unlike isozymes, also be used to analyze regions controlling the transcription of genes, for example, transcription factor binding sites. In comparison to SSRs and AFLPs, they are valuable markers to study adaptation of plants, for example, to changing environmental conditions (Gailing et al. 2009). For human and plant model organisms, this type of marker is already establishment and often used (e.g., Populus tremula, Ingvarsson 2004). However, SNP markers are nowadays more and more applied for non-model organisms like most of the forest trees (e.g., Seeb et al. 2011, Helyar et al. 2011).

Unfortunately, SNP analyses in human populations revealed that only few SNPs can be associated with phenotypic traits (Yoshiura et al. 2006). Some of these SNPs with a direct impact on phenotypes are likely to be under selection, while the vast majority of SNPs are likely to behave selectively neutral. However, besides the study of adaptation, SNPs in non-coding regions can also be used instead or additional to other neutral markers. The analysis of an unprecedented number of mostly selectively neutral SNP loci allows new insights in the population genetic structure of species that cannot be found with other genetic markers. For example, the observation of more than 500,000 SNPs in over 3,000 Europeans revealed overall genetic differentiation patterns among humans on the continent closely resembling their spatial distribution on the continent (Novembre et al. 2008). Furthermore, comparing AFLP and microsatellite markers with SNP markers, the latter markers have some important advantages. The scoring is unambiguously and comparable between laboratories, even if different platforms are used for the analysis. Jones et al. (2007) compared SSR and SNP markers in maize and concluded that SNP markers have a lower level of missing data and are more reliable. For the analysis of SNPs, multiplexing can be conducted easily, and thus, the throughput is very high. Estimations show that SNP costs are lower in comparison to SSR markers (Jones et al. 2007). However, efficiency and costs strongly depend on the platform used for SNP scoring, but it is predictable that the efficiency costs will decrease in the future.

The aim of this study is to detect SNPs within candidate genes, related to phenotypic traits in beech. European beech (Fagus sylvatica L.) is one of the predominant and most important tree species in European forests and covers a large geographic range in Central Europe. The species is wind-pollinated, predominantly outcrossing, a monoecious tree with heavy fruits and therefore with limited seed dispersal.

So far, in beech, most studies on the genetic diversity and differentiation were focused on the spatial genetic structure or on the impact of different silvicultural treatments using AFLP and microsatellite markers (Vornam et al. 2004; Buiteveld et al. 2007; Nyári 2010; Oddou-Muratorio et al. 2011). In the context of global climatic changes, predicting less precipitation in summer and higher precipitation in winter contradictory opinions exist whether beech will be adaptable to the enhanced drought stress conditions in the summer months (Gessler et al. 2007; Rennenberg et al. 2004; Ammer et al. 2005). Another effect of the predicted global change is the extending growing season influencing the growth of beech in the future. Earlier bud burst is supposable, which will lead to an increasing risk of late frost damage. The analysis of the variation within ‘candidate’ genes potentially involved in adaptation to a phenotypic trait is one possibility to investigate the genetic background of adaptation. Until now, only few studies aim to identify genes that are involved in drought stress response and bud phenology in beech (Lalagüe et al. 2010), and only a limited number of beech sequences are available (Jimenez et al. 2008; Olbrich et al. 2005, 2010; Schlink 2011). Therefore, the candidate gene approach described here is based on both published F. sylvatica sequences and orthologous sequences identified in other plant species such as oaks (Gailing et al. 2009; Vornam et al. 2011).

SNPs were analyzed in both coding (exons) and non-coding regions (introns) of the identified genes. For the purpose of using SNP markers additionally or in place of microsatellite markers, it is necessary to analyze both regions. For the study of adaptation, SNPs in coding regions changing the amino acid composition of the gene products (non-synonymous SNPs) are most interesting, but non-coding regions can also be of relevance. Whereas non-synonymous SNPs potentially lead to changes of protein structures, SNPs in intron regions potentially influence gene splicing and enable a single gene to increase its coding capacity producing several structurally distinct isoforms (Baek et al. 2008).

The results described here are a prerequisite for association mapping in natural populations in order to identify SNPs correlated to phenotypic traits like drought stress response and bud phenology. Other applications of the analysis of SNPs are, for example, population genetic studies concerning the history, structure and demography of populations or molecular systematic studies and parentage analyses (Garvin et al. 2010; Morin et al. 2004). The SNPs identified in this study are suitable for population genetic investigations complementing other frequently used markers such as microsatellites and AFLPs. Furthermore, this study provides the first estimates of nucleotide and haplotype diversity in F. sylvatica.

Materials and methods

Plant material

Fresh leaves were sampled in early summer 2009 in three different regions of northern Germany along a rainfall gradient (Table 1). All stands are jointly investigated by several research groups within the collaborative project ‘Climate Impact and Adaptation Research in Lower Saxony’ (KLIFF; http://www.kliff-niedersachsen.de.vweb5-test.gwdg.de/?page_id=26). Each region is represented by two populations differing in their soil type. Three trees per population were used for SNP identification. Thus, the total sample size was 3 (regions) × 2 (populations/region) × 3 (trees/population) = 18 trees. The investigated trees were separated by a distance of at least 50 m to minimize the risk of sampling related plants (Vornam et al. 2004).

Table 1 Sampling sites in Germany, Lower Saxony, and Saxony-Anhalt

Selection of candidate genes

All candidate genes have been chosen based on literature surveys suggesting an impact of the genes on either drought stress or bud phenology (Table 2). The Evoltree EST database (http://www.evoltree.org) and the EMBL Nucleotide Sequence Database (http://www.ebi.ac.uk/embl/) were mainly used to find corresponding F. sylvatica sequences. Alternatively, sequences of Quercus petraea were transferred to F. sylvatica (Vornam et al. 2007; Vidalis 2011). The selected sequences were verified by a TBLASTX search (Washington University Basic Local Alignment Search Tool Version 2.0) and used for primer design in order to amplify the corresponding genomic regions in beech.

Table 2 Selected candidate genes related to drought stress response or bud phenology

DNA isolation, amplification, cloning, and sequencing

Total DNA was extracted from leaves using the DNeasy™ 96 Plant Kit (Qiagen, Hilden, Germany). The amount and the quality of the DNA were analyzed by 0.8 % agarose gel electrophoresis with 1 × TAE as running buffer (Sambrook et al. 1989). DNA was stained with ethidium bromide, visualized by UV illumination, and compared to a Lambda DNA size marker (Roche).

Primers for amplification and direct sequencing of the amplification product (Table 3) were designed by using the program Primer3 (v.0.4.0; Rozen and Skaletsky 2000; http://frodo.wi.mit.edu/). Primers were checked for self-annealing, dimer, and hairpin formations using the program Oligo calc: Oligonucleotide Properties Calculator (http://www.basic.northwestern.edu/biotools/oligocalc.html). PCR amplifications were conducted in a 15 μl volume containing 2 μl of genomic DNA (about 10 ng), 7.5 μl HotStarTaq Master Mix Kit (Qiagen, Hilden, Germany), and 0.3 μM of each forward and reverse primer. The PCR protocol consisted of an initial denaturation step of 95 °C for 15 min, followed by 35 cycles of 94 °C for 60 s (denaturation), different temperatures according to the primers (Table 3) for 45 s (annealing), 72 °C for 90 s (extension), and a final extension step of 72 °C for 20 min.

Table 3 Primer sequences and corresponding annealing temperatures for the selected candidate genes (Accession No: EMBL Nucleotide Sequence Database (http://www.ebi.ac.uk/embl/))

PCR products were analyzed by 1 % agarose gel electrophoresis with 1 × TAE as running buffer (Sambrook et al. 1989). DNA was stained with ethidium bromide and visualized by UV illumination. PCR products were excised from gel and purified using the Geneclean® kit (MP Biomedicals, Illkirch, France). The purified products were cloned into a pCR2.1 vector using the TOPO TA Cloning® kit (Invitrogen, Carlsbad, CA) with slight modifications. The inserts were amplified by colony PCR using M13 forward (-20) (5′-GTAAAACGACGGCCAG-3′) and M13 reverse (5′-CAGGAAACAGCTATGAC-3′) primers, visualized by agarose gel electrophoresis, excised from the gel and purified (see above). Three to four different clones of the fragments were sequenced using both M13 forward and M13 reverse primers in order to identify the presence of different haplotypes within individuals (heterozygotes) and to control for sequencing errors. The sequencing reaction was carried out with the Big Dye® Terminator v.3.1. Cycle Sequencing Kit (Applied Biosystems) based on the dideoxy-mediated chain termination method (Sanger et al. 1977). Sequencing reactions were run on an ABI 3100xl Genetic Analyser (Applied Biosystems). The sequenced fragments were verified by a TBLASTX search. Putative introns and exons were determined following the GT-AG rules (Breathnach et al. 1978).

Data analysis

For editing and visual examination of the sequences as well as for the analysis of SNPs and indels (insertions/deletions) within the genes, the sequences were aligned using Codon Code Aligner (CodonCode cooperation, http://www.codoncode.com) and BioEdit version 7.0.9.0 (Hall 1999) using ClustalW multiple alignment (Thompson et al. 1994). Only polymorphisms with Phred scores above 25 in the chromatograms were considered (Ewing et al. 1998). Only SNPs appearing at least twice were analyzed in order to avoid sequencing errors. Haplotype diversity, nucleotide diversity (π), and F ST values were calculated excluding indels using DnaSP v.5.0 (Librado and Rozas 2009).

Results

Fragments from ten different genes were successfully amplified, identified, and analyzed. After sequencing, all fragments were verified using TBLASTX search. Any similarity with an E Value of less than 10−3 was considered to be a hit. In total, 9,468 bp were analyzed with 4,418 bp in exon regions and 5,050 bp in intron regions (Table 4). All exons and introns could be determined following the GT-AG rule. No alternative splicing was found. The reading frame was assessed according to the TBLASTX results (see above).

Table 4 Length, exons, introns, indels, and SNPs of the amplified candidate genes

Insertions/deletions

In seven different genes, 11 indels (insertions/deletions) were identified, mainly in intron regions (Table 4). Some of them showed a microsatellite repeat motif (see supplementary material). Only two indels also represented by microsatellite motives were found within coding regions (gene ERD and CHZFP). The lengths of these indels were multiples of 3 bps; thus, the reading frame is not shifted.

Single nucleotide polymorphisms

Single nucleotide polymorphisms only appearing once were excluded from the analyses in order to avoid the selection of false positives caused by sequencing errors, although they could be true SNPs. Therefore, only common SNPs are presented here that may be also present in F. sylvatica trees in other regions in Europe than investigated in this study. Considering these limitations, in total, 63 SNPs were found differently distributed over the analyzed gene fragments. The results indicate that numerous of these SNPs are linked (see supplementary material). Excluding the potentially linked SNPs from the analysis, 45 SNPs remain. However, because of the low number of investigated trees, the linkage of these SNPs is not unambiguous and it is not possible to clearly define a set of tag SNPs.

More SNPs were found in non-coding regions (1 SNP every 112 bp) than in coding regions (1 SNP every 245 bp). Eighteen SNPs were found in coding regions, and seven of them were non-synonymous. All non-synonymous SNPs led to an amino acid exchange, no one caused an early stop codon. The number of haplotypes ranged from one to eleven. The nucleotide diversity (π) was higher at non-coding sites than at coding sites for most genes. Exceptions are the genes GPX and PhyB for which the investigated non-coding regions were very short (Table 5). Furthermore, the nucleotide diversity at synonymous sites was in most cases higher than at non-synonymous sites (Table 5).

Table 5 Haplotype diversity and nucleotide diversity for the different genes (syn.: synonymous)

F ST was analyzed grouping the studied trees according to their region (Calvörde, Göhrde or Unterlüß), each region includes trees from two different populations. The detected values were rather low, between 0 and 0.157 with a mean value of 0.012 (Table 5). This mean value is comparable to the results of a study analyzing the same populations with nine microsatellite markers (F ST: 0.022; Seifert, unpublished). However, the strongest differentiation with microsatellite loci was 0.032, whereas some candidate genes showed a considerable higher differentiation. The highest differentiations were found investigating the genes ALDH, ERD (Part1), and IDH (Part 2) with values above 0.05 (Table 5). Derory et al. (2010) found comparable results for SNPs analyzed in candidate genes and microsatellites for Q. petraea.

The partial sequence encoding aldehyde dehydrogenase was found to be of special interest. All but one of the detected SNPs (non-coding, synonymous, and non-synonymous) were represented in three different haplotypes. The indel found in the non-coding region is also linked to two other non-coding SNPs. Within this gene fragment, two non-synonymous SNPs were identified in the same codon, which were not linked. Therefore, three different amino acids are encoded depending on the combination of the SNPs indicating different lineages of the alleles. The dehydrin sequence is also interesting because the larger part of the sequence represents an exon region in which two SNPs were detected. Both of them are non-synonymous, and one is linked to the third non-coding SNP.

The position of the SNPs in the gene regions and additional information about the composition of the indels can be found in the supplementary material.

Discussion

In this study, parts of ten different candidate genes have been investigated. Because of the limited sequence information for F. sylvatica, it was not possible to sequence whole genes. However, this study was able to detect numerous SNPs and indels in non-coding and, probably more important, in coding regions of genes potentially involved in drought stress response and bud phenology. Most of the indels were found in intron regions. Only two were located in exon regions. Indels in exon regions are important due to their influence on protein structures and thus, on phenotypic trait changes (for example, reviewed by Li et al. 2002). However, short indels, like the ones that were found in this study, seem to have only minor impact on protein structures (Kim and Guo 2010). Because SNPs appearing only once were excluded from the analysis, the presented data most likely underestimate the number of SNPs. Other reasons for underestimating the number of SNPs are the limited number of investigated samples and sequencing of only three to four clones, which does not allow to correctly identify all heterozygotes. Nevertheless, 63 reliable SNPs were found. As expected, more SNPs were found in non-coding regions than in coding regions and the nucleotide diversity was higher in non-coding sites than in coding sites.

Some of the non-synonymous SNPs detected in this study are of special interest because they might have an influence on the protein structure and protein function. For example, one non-synonymous SNP found in the partial sequence encoding aldehyde dehydrogenase is coding for proline, which leads to confirmation changes of the protein (Chou and Fasman 1974). The first non-synonymous SNP found in the partial dehydrin gene sequence leads to an amino acid substitution from aspartic acid to histidine implicating also a changed charge profile of the different genotypes from negatively charged to positively charged.

The nucleotide diversity (π × 10−3) found in this study ranged from 0 to 6.62 and is comparable to the nucleotide diversities analyzed in other tree species, for example, Q. petraea (1.09–14.7, Derory et al. 2010; 3.02–11.96, Gailing et al. 2009), Quercus crispula (6.67–7.21, Quang et al. 2008), P. tremula (2.7–18.8, Ingvarsson 2004), Pinus taeda (0.1–11.79, González-Martínez et al. 2006), and Pseudotsuga menziesii (2.37–13.78, Krutovsky and Neale 2005). The mean nucleotide diversity of 2.64 for F. sylvatica is comparatively low (Q. petraea: 6.15 or 5.42; Q. crispula: 6.93; P. tremula: 11.1; P. taeda: 7.5; P. menziesii: 6.55). One reason for lower nucleotide diversity values may be the exclusion of all SNPs appearing only once from the analysis (see above). However, the significance of mean values for nucleotide diversity depends on the analyzed candidate gene. Olson et al. (2010) also found in Populus balsamifera that the diversity is affected by the functional classification of the analyzed candidate genes. They found higher diversity in gene fragments with insertion/deletion length variation (indels) than in fragments that did not contain indels. Studies that do not include regions with length variation may slightly underestimate the overall level of nucleotide diversity.

The variation found in this study can be used to develop SNP markers and to apply them additionally or instead of neutral SSR or AFLP marker. SNP markers are more optimal markers for many applications because they are suitable for high-throughput analysis, inexpensive, highly reproducible, easy to score, comparable between different laboratories, and some SNPs clearly show higher differentiation values. Although SNP markers have some advantages, there are also some drawbacks that have to be discussed. A disadvantage of SNPs is their normally biallelic character. Thus, they are less polymorphic than SSRs. To replace ten to twenty highly polymorphic SSR markers, around 100 neutral SNP markers are necessary (Kalinowski 2002). However, the virtually unlimited number of SNP markers in the different parts of the genome of higher organisms creates opportunities for the investigation of genetic variation within species with numerous applications in population genomics. Furthermore, an ascertainment bias can occur, that is, the deviation from the expected allele frequency distribution for the case that the SNPs are identified based on only a few individuals and later used for the genotyping of a large sample set. This problem can be overcome, for example, by a direct correction of the statistical estimators (Helyar et al. 2011).

Another important application of SNPs is the study of adaptation. The SNPs found in this study can be useful, for example, to extend the investigation of Kraj and Sztorc (2009) who analyzed the variability of phenological forms (bud burst) in beech using a set of five microsatellite markers. They pointed out that the neutral microsatellite loci are not directly linked with adaptive genetic variation and the genetic differences between the phenological forms of beech (early-, intermediate-, and late-flushing individuals) have therefore no direct effect on the fitness of these forms. But genetic diversity and fitness are the basis for the ability of forest tree populations to adapt to changes of the environment (Krutovsky and Neale 2005). Because forest trees are continuously challenged by changing environmental conditions during their lifetime, adaptive genetic variation in relevant genes and phenotypic plasticity are essential for the long-term adaptation to stressful conditions. Thus, the knowledge of adaptive genetic variation is a basis for future management and conservation strategies of forests (Krutovsky and Neale 2005) and can assist in breeding in combination with traditional phenotypic selection (Neale 2007). Furthermore, the results presented here are a prerequisite for association mapping studies in order to identify genomic regions and even individual nucleotides underlying phenotypic variation. The success of such an approach largely depends on the reasonable selection of candidate genes. This study revealed huge differences in diversity among the investigated candidate genes. Whereas the genes with regulatory function such as the cys-his-zinc finger protein (CHZFP) representing a transcription factor show low or moderate SNP variation, genes with a structural function as the ascorbate peroxidase show comparatively high SNP variation.

In the view of the above considerations, we propose to apply the genomic resources developed for beech by the identification and characterization of SNPs in coding and non-coding regions of candidate genes to investigate both the genetic basis of adaptive variation and the population structure of beech at the ultimate level of genetic resolution. In future, there will be the possibility to use whole genome sequencing for these applications. But considering the costs and the possibilities at the moment for non-model organisms, the comparable sequencing of (partial) genes and the identification of SNPs presented in this study is the best available method.