A set of nuclear SNP loci derived from single sample double digest RAD and from pool sequencing for large-scale genetic studies in the European beech Fagus sylvatica

The large-scale spatial genetic structure of European beech, Fagus sylvatica, has been until now poorly studied. We conducted double digest RAD sequencing (ddRADseq) on 54 beech individuals stemming from 36 provenances to discover spatially informative nuclear SNP loci. In addition, two pools derived from 14 early and 14 late flushing individuals each were sequenced with Illumina HiSeq. From an initial amount of 5,464 loci detected by ddRADseq, we selected 559 informative loci. Further 27 additional loci showing significant allelic differences among early and late flushing individuals could be identified after a genotyping on 95 test individuals. The final selection of 578 loci was submitted to probe design for targeted genotyping by sequencing, which yielded 543 loci. The new set of SNP loci should be, after validation on a larger sample size, useful for large-scale genetic studies in this economically-important species.

European beech, Fagus sylvatica, is one of the most significant forest species in Europe, which naturally occurs in temperate regions, from France to the west of Ukraine, and from northern Spain to southern Sweden (EUFORGEN network, http:// www. eufor gen. org/ speci es/ fagus-sylva tica/). The first genetic studies on beech dates back to the 1980s (Mueller-Starck 1985) and the co-dominant isoenzyme markers, followed later by SSRs (Pastorelli et al. 2003), allowed a better knowledge of the mating system and gene flow (Lander et al. 2021), population structure (de Lafontaine et al. 2013) and response to biotic and abiotic stress (Mueller and Gailing 2019) in this species. Nowadays, the availability of highthroughput sequencing technologies facilitates the sequencing of full-genomes as well as the development and screening of single nucleotide polymorphism (SNP loci). In beech, as for many other European forest species, many researchers focused on polymorphism in candidate genes to detect selection processes due to climatic changes, in particular adaptation to drought (Cuervo-Alarcon et al. 2018). However, only a few large-scale genetic inventories describing the spatial genetic structure along the distribution of beech are available (Magri et al. 2006;Magri 2008;Postolache et al. 2021), although a reference genome is published (Mishra et al. 2018). In this study, we conducted a double digest restriction-site associated DNA sequencing (ddRADseq) on European beech samples covering most of its distribution range and from several regions within Germany, to detect spatially informative polymorphisms. We additionally searched for loci associated with early or late bud burst, as potentially spatially differentiated polymorphisms.
We extracted DNA from fresh leaf or cambium material (Dumolin et al. 1995) for 54 beech trees taken from the provenance trial Schädtbek established in 1995. Our samples covered 36 provenances and 18 countries, focusing on Germany (26 trees) ( Table 1). Through this unbalanced sampling, we expected to discover informative SNP loci showing differentiation among stands within Germany, as well as spatially-informative loci within Europe. First, ddRADseq (Peterson et al. 2012) was applied to discover nuclear SNP loci. We used the published beech reference genome (v1.2, Mishra et al. 2018) to map our reads and conduct variant calling (library preparation, ddRADseq and bioinformatic conducted by Floragenex, Portland, USA). A total of 5464 loci passed all quality filters using * Celine Blanc-Jolivet celine.blanc-jolivet@thuenen.de 1 Thünen Institute of Forest Genetics, Sieker Landstrasse 2, 22927 Grosshansdorf, Germany the "stringent" criteria (calling rate > 90% and flanking regions available). However, only 2988 loci were informative (minor allele frequency > 0.01). Discriminant analysis of principal components (DAPC, "dapc" in R-package "Adegenet", Jombart and Ahmed 2011) was conducted with four putative clusters, which arranged the samples in a west-east spatial pattern (Fig. S1). The 63 loci with the highest variable contributions were selected. The same procedure was applied separately on the 26 samples from Germany with three clusters and yielded 39 loci. To select loci potentially with high polymorphism and exclude SNPs probably resulting from the merging of paralog loci (no excess of heterozygotes), we grouped the data per country, and the data of the German samples per province. We estimated average H s (within-population gene diversity), as well as F is and A r (allelic richness) per locus ("basicstat" in R-package "hierfstat", Goudet 2005) for groups with at least two individuals. We filtered 325 and 305 loci, respectively for the complete dataset grouped by countries and the German provenances (F is > 0 and H s > 0.4, Table S1). A total of 559 unique loci were finally selected, and we visualized the effect of locus selection on the differentiation among provenances (PCA, "dudi.pca" in R-package "ade4", Dray and Dufour 2007; Fig. S2). Among the 54 individuals sequenced with ddRADseq, a subset of 14 early and 14 late flushing was separately evaluated. From the 2996 putative informative SNPs detected in this subset of samples, 486 were selected for their genetic differences between the two groups. We additionally conducted an Illumina HiSeq 150 bp Paired-end pooled sequencing (formerly GATC, Konstanz, Germany) of two groups (poolseq, 14 early and 14 late flushing individuals) with 84× coverage each. 6800 top SNPs showed the highest allele frequency difference estimates (Ries et al. 2016) between the two pools. Applying a genome scan technique (Soyk et al. 2017), nine genomic regions were identified as top scaffolds showing an enrichment of top SNPs. A total of 1006 loci were submitted to probe design to select 500 loci for targeted genotyping by sequencing (SeqSNP, LGC Genomics GmbH, Berlin, Germany). Interestingly, ddRAD and poolseq showed sticking differences in probe specificities (Table S2). Genotyping was conducted on 95 test individuals and 27 loci were identified as potentially differentiated among early and late flushing individuals. Surprisingly, eight from these 27 loci were already included in the set of 559 loci identified from the ddRAD data at all 54 individuals (potentially spatially informative and/or high diversity loci) (Table S1). Altogether, we ended with a selection of 578 loci.
From the PCA and DACP analyses, we expect that our set of loci will be useful to disentangle genetic structure over beech's distribution range. Indeed, the spatial grouping suggested by our data fits the findings of Postolache et al. (2021). A final set of 543 loci could be designed for SeqSNP genotyping and will be used for large genetic inventories.

Conflict of interest
The authors have no conflicts of interest to declare that are relevant to the content of this article.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.