A genomic dataset of single‐nucleotide polymorphisms generated by ddRAD tag sequencing in Q. petraea (Matt.) Liebl. populations from Central-Eastern Europe and Balkan Peninsula

This genomic dataset provides highly variable SNP markers from georeferenced natural Quercus petraea (Matt.) Liebl. populations collected in Bulgaria, Hungary, Romania, Serbia, Bosnia and Herzegovina, Kosovo* and Albania. These SNP loci can be used to assess genetic diversity, differentiation, and population structure, and can also be used to detect signatures of selection and local adaptation. The dataset can be accessed at https://doi.org/10.5281/zenodo.3908963/ (Tóth et al.2020). Associated metadata available at https://metadata-afs.nancy.inra.fr/geonetwork/srv/fre/catalog.search#/metadata/b6fee4fa-01e9-44d0-92f5-ad19379f9693.


Background
In the last decade, restriction site-associated DNA (RAD-seq) sequencing is increasingly used to identify and genotype large numbers of single-nucleotide polymorphisms (SNPs) in various organisms (Allendorf et al. 2010;Davey and Blaxter 2010). Due to the advancements of high-throughput sequencing, this approach allows to produce extremely large collections of genomic data that is fairly evenly distributed across the genome providing hundreds of thousands of polymorphic loci (Allendorf et al. 2010;Seeb et al. 2011). A variation of the approach, known as ddRADseq (double-digest RAD-seq), has been widely applied for population genomic analyses in non-model organisms, such as forest tree species, lacking available reference genomes (Konar et al. 2017). ddRAD-seq is suitable for genotyping a large number of loci to determine the extent of genetic diversity, to identify outlier loci being under selection (Schwartz et al. 2010;Burak et al. 2018) and to infer eco-evolutionary population dynamics (Legrand et al. 2017;Hendry 2019).
While forest tree populations are well-studied in Europe, the Central-Eastern European region, including the Balkan Peninsula is less investigated with high-resolution and genome-wide genetic markers. In fact, this region has high topographic and climatic heterogeneity, and long-term environmental stability that significantly contributed to long-term persistence of genetic diversity (Tzedakis 2004;Feliner 2011;Fassou et al. 2020). In addition, the Balkan Peninsula has been postulated to be the primary source of post-glacial expansion while functioning as a Pleistocene glacial refugia (Zhelev 2017;Gömöry et al. 2020). Therefore, this area is considered an important reservoir of genetic resources of European forest tree taxa (Gömöry et al. 2020).
Sessile oak (Quercus petraea (Matt.) Liebl.) is a widely distributed late-successional forest tree species of the white oak group (sect. Lepidobalanus) with great ecological and silvicultural importance in Europe (Mölder et al. 2019;Blanc-Jolivet et al. 2020). However, the species' distribution towards the southern limit on the Balkan Peninsula tends to be fragmented (Bruschi et al. 2003;Eaton et al. 2016), and being under intensive forest management and severe ecosystem disturbances such as drought (Milad et al. 2011). Similarly, species' intraspecific genetic differentiation increases towards the southeast due to its complex evolutionary history. Since current studies mostly focusing on western European populations, the southeastern European distribution range is poorly sampled and investigated (Lang et al. 2020;Leroy et al. 2017Leroy et al. , 2020Kremer and Hipp 2020). This inequality might potentially lead to underestimation of regional genetic diversity. For these reasons, we generated a novel resource of genome-wide SNPs, using ddRAD-seq, in carefully selected natural populations.

Sampling strategy
The sampling of plant material was designed to cover the southeastern natural distribution range of Q. petraea. Sites were selected based on literature data and personal communication with local forestry experts (contributing authors). For literature data, available information on neutral genetic patterns in European white oaks was reviewed (Dumolin-Lapegue et al. 1997;Gömöry et al. 2001Gömöry et al. , 2020Bordács et al. 2002;Csaikl et al. 2002;Petit et al. 2002a, b;Slade et al. 2008;Dering et al. 2008), and the geographic regions corresponding to the major genetic clusters were included in a candidate list of locations. Existing databases of forest genetic resources also contributed in  Fig. 1). Out of these, four populations situated in Bulgaria (western Balkan, Rila, Rhodope, and Strandzha mountains), three in Hungary (Kőszeg, Bakony, and Mecsek mountains), Romania (Gurghiu, northern and southern foothills of Fagaras mountains), Serbia (Fruska Gora, Rudnik, and Stovoli mountains), Bosnia and Herzegovina (Kozara, Javorova, and Maglic mountains), and one in Kosovo* (Blinaja) and Albania (Djeravica Mt.), respectively (Tóth et al. 2020). Our sampling sites are found in natural sites either as part of nature reserves or without non-natural disturbance and human influence. Accordingly, sessile oak populations were regenerated naturally. Fresh leaves were collected from even-aged mature trees (> 70-year-old) by considering at least a 30-m isolation distance between trees to limit sampling-related individuals. Spatial coordinates (GPS) of each tree in each population were recorded. Leaf samples were both dried using silica gel and stored frozen (− 20 °C) in plastic bags before DNA extraction.

Library preparation and RAD-tag sequencing
Total genomic DNA was extracted from seven leaf disks (5-mm diam.) per each individual using the method of Dumolin et al. (1995).
DNA of each Q. petraea sample was quantified by using the Qubit dsDNA BR Assay Kit and Qubit 3.0 Fluorometer (Thermo Fisher Scientific, Waltham, MA, USA). A total of 50 ng of DNA per sample was double-digested with 0.1-0.1 μl of PstI and MspI enzymes at 37 °C for 2 h (FastDigest restriction enzymes; Thermo Fisher Scientific, Waltham, MA, USA). The enzyme combination was selected based on the study of Cumer et al. (2018) and a preliminary restriction site analysis carried out in CLC Genomic Workbench version 12.0 (QIAGEN Bioinformatics, Hilden, Germany). Fragments were double-sided size selected using KAPA PureBeads, with a 0.55-0.80X solution/bead ratio (Roche, Basel, Switzerland) in order to isolate fragments in the range of 300-600 bp. After quantification, inserts (3 ng) were ligated to adapters (Table 2) with using T4 DNA Ligase according to the manufacturers' protocol (Thermo Fisher Scientific, Waltham, MA,

De novo assembly and SNP calling
All bioinformatics processing was carried out on a Silicon Computers (SGI) HPC server, allocating 40 cores (80 threads) and 38 GB RAM, located at the University of Sopron (UOS), Sopron, Hungary.  Raw short-read sequences (77,101,088) were demultiplexed and adapter-trimmed by using the MiSeq Control Software (Illumina, San Diego, CA, USA). In total, 69,993,001 sequences contained adapter sequences kept for further processing. Next, the implemented FastQ Toolkit was applied to trim bases at the 3′ and the 5′ end with a quality score lower than 30. Reads having a mean quality score less than 30 and shorter than 200 bp were filtered. Computational processing of short-read data was carried out with Stacks 2.0 (Catchen et al. 2013;Rochette et al. 2019). Whole reads were quality-filtered using a sliding-window method (15% of read length) implemented in "process_radtags." Reads having a quality score below 90% (raw phred score of 10) were discarded (Catchen et al. 2011). In "process_radtags," reads were truncated to 200 bp as a prerequisite of further processing and to avoid the lower-quality bases to present at the end of the reads (Catchen et al. 2011). During this filtering step, 42,273 sequences were discarded. In addition, one individual was removed from the dataset (BU2-10) since an insufficient number of high-quality reads remained after filtering. RAD loci were reconstructed following the de novo pipeline implemented in "denovo_map.pl" command, in which "ustacks" builds loci and calls SNPs in each sample, then "cstacks" creates a catalogue of all loci for all the samples and "sstacks" matches the loci of the samples against the catalogue (Catchen et al. 2011;Rochette et al. 2019). To execute "denovo_map.pl," it is essential to optimize the parameters of the command, and define (m) the minimum number of reads required to form an allele, (M) the maximum number of mismatches allowed between two alleles, and (n) the maximum number of mismatches allowed between two individual loci to consider them as homologous (Mastretta-Yanes et al. 2015;Paris et al. 2017). We optimized these parameters throughout the "r80" method, which can effectively maximize the number of SNPs found in 80% of the individuals (Paris et al. 2017). We checked the iterative values (ranging from 2 to 10) of M and n how they affect the gained SNPs (Fig. 2a). After, based on this, we chose the threshold of M = 3, n = 3, applying the M = n rule for the final run (Paris et al. 2017). For the stack-depth parameter, we selected the default m = 3 value (three identical reads). The program "gstacks" was used to assemble paired-end contigs and re-call SNPs using the population-wide data (Rochette et al. 2019).
The "denovo_map.pl" pipeline aligned paired-end reads for a total of 112,750 RAD loci in which 1490 loci had paired-end reads that could not be assembled (1.3%). The remaining 111,260 loci (98.7%) had an average contig size of 361.8 bp. The effective per-sample coverage resulted in 14.3 × (mean) with a 2.2 × standard deviation (min = 10.1 × , max = 27.0 ×) (Fig. 2b). Population-specific statistics are presented in Table 3.
The set of SNPs were called and further processed with the "populations" program (Catchen et al. 2011), in which markers with a minor allele frequency < 0.05, missing individual rate > 0.8, and significant deviation from Hardy-Weinberg equilibrium (HWE, P < 0.001) were filtered out (Catchen et al. 2011). However, the "minimum number of populations" value was set to 1 (SNPs were kept when they were present in at least one population), allowing to be later filtered by the user for the specific research question. Out of the 111,260 RAD loci, 105,326 loci did not pass the missing rate of sample/population and were below the MAF threshold; also, 1099 loci were blacklisted and discarded having variants with significant HWE exception. The final dataset consisted of 5934 loci, composed of 2,443,502 bases, of which 21,951 are highly polymorphic variant sites (SNPs).
The data file presented in this paper is a PLINKformatted UTF-8 encoded, white-space (tab)-delimited, set of ped and map files. PLINK is a publicly available and widely used software for genomic data manipulation and analysis, capable of converting to other file formats and can perform several population-based tests (Purcell et al. 2007). Within the ".ped" file, the first column corresponds to the population code (Pop_ID: 1-18), the second to the individual identifier. From column three to six, values are not provided (father ID, mother ID, sex, and phenotype are not defined). From column seven, two columns code an observed allele at each SNP position. Missing data are coded as "0 0" as it is defined in the standard PLINK data format (PLINK 1.07; http:// zzz. bwh. harva rd. edu/ plink/). The ".map" file contains four columns, the first is the chromosome number (un: unknown due to de novo assembly), the second is the SNP ID, the third is the SNP genetic position (0: not defined), and the fourth column contains the physical position (bp) within the RAD loci. Users do not need to combine or merge the files; it is ready to be loaded and  analyzed. Population IDs (Pop_ID), individual IDs (Ind_ ID), and the corresponding GPS coordinates (GPS_1: latitude and GPS_2: longitude) are located in the Appendix Table 5. Further information can be requested from the corresponding author.

Technical validation
Validation of the dataset was performed with manual, visual, and numerical tests. The quality of raw short-read data, of all individuals (179), was assessed using FastQC v0.11.9 (Andrews 2010) by investigating per-base quality and per-sequence quality at three steps: at raw read state directly after sequencing, after sequence quality trimming (3′ and 5′ end), and following sequence processing (whole read filtering). Results were consistent with a standard Illumina run, and therefore considered to be of sufficiently high quality for further analysis (Kircher et al. 2011).

Reuse potential and limits
Data papers on genome-wide polymorphisms are extremely scarce especially on geographic regions that are sources of biodiversity and in case of forest trees lacking a reference genome. Our data can significantly contribute to the growing number of RAD-seq studies by providing genotype information on species' rarely investigated southeastern distribution range.
To evaluate the potential of the data, at the population level, we calculated genetic diversity indices including the expected heterozygosity (H e ), observed heterozygosity (H o ), and the inbreeding coefficient (F IS ) with R (R Core Team 2013) using the "adegenet" (Jombart and Ahmed 2011) and "hierfstat" (Goudet 2005) packages (Table 4). Trends of our resulting indices were compared to Blanc-Jolivet et al. (2020), since up until now, no SNP dataset (i.e., SNP genotype collection) from genomic resources has so far been published for Q. petraea. Since our estimators are in accordance with the Blanc-Jolivet et al.
(2020) data, as well as with other publications, we can confirm that our panel of markers is suitable for population genomic studies.
It is important to note, however, that polymorphisms from reduced representation methods, such as RADseq, might introduce biases in population genetics estimates (Arnold et al. 2013;Andrews et al. 2016), and phylogenetic reconstructions and genomic mappings might require even larger genomic coverage as well as higher SNP density (Sims et al. 2014;Eaton et al. 2017). Despite of these, our generated high-quality SNP resource consisting of 21,951 SNPs provides a valuable tool for future population genetics and genomics studies allowing to carry out wide variety of reference-genomefree investigations (Peterson et al. 2012). The available spatial information (GPS coordinates; Table S1) allows the analyses of spatial genetic pattern, structure (including admixture), and gene flow and hybridization. Due to the relatively large-scale genome representation, finescale determination of relationship among individuals and populations is also possible. For example, in the context of phylogeography, our genotype data may supplement studies on evolutionary history, migration from glacial refugia, and past hybridization events increasingly studied nowadays (Leroy et al. 2017(Leroy et al. , 2020. Since ddRAD-seq has the potential to generate strongly differentiating markers, our data could be used to investigate patterns of divergent selection processes, thus can infer population-level adaptation. Overall, this dataset provides a unique opportunity to study the genetic background of natural populations and reveals important ecological and evolutionary insights into the present and future distribution of sessile oak.  (Nei 1978), H e expected heterozygosity (Nei 1978), F IS inbreeding coefficient (ns, non-significant;Nei 1987)