Genome-wide SNP discovery in native American and Hungarian Robinia pseudoacacia genotypes using next-generation double-digest restriction-site-associated DNA sequencing (ddRAD-Seq)

Robinia pseudoacacia L. (commonly known as black locust) is an economically and environmentally important plant, native to the eastern USA, and introduced into several European countries, including Hungary. An early successional leguminous tree, the black locust is characterized by tolerance to degraded sites, rapid growth rate, dense and rot-resistant wood, and heavy flowering. Due to its economic potential and environmental impact, the historic Hungarian breeding strategy targeted not only increased wood production but also in wood and honey-production quality. However, because many important features of the species are under polygenic control, genome-wide genetic data provided by high-throughput sequencing technology could make possible the identification of gene variants with identifiable functional effects on complex traits. Furthermore, the evaluation of the breeding efforts carried out so far would be also achievable, by comparing bred/selected genotypes with those from the original habitat. This paper provides a genomic dataset with highly variable SNP markers from native American and Hungarian Robinia pseudoacacia L. individuals. These SNP loci can be used to assess genetic differentiation, and to detect signatures of polygenic determination of economically important traits, providing a basis for further research into this species.

and environmental impacts have been extensively studied in the USA, Europe, and China resulting in several reviews (Cierjacks et al. 2013;Sitzia et al. 2016;Vítková et al. 2017;Nicolescu et al. 2018Nicolescu et al. , 2020Puchałka et al. 2021). It is equally clear from these works, that black locust is a fast-growing tree species with the ability to fix nitrogen. Probably that is the reason why it is a popular plant used in vegetation restoration in degraded areas (Yüksek and Yüksek 2011), thereby becoming important for the development of a better ecological surrounding (Shi et al. 2021). Besides, it is highly appreciated in the wood industry producing large amounts of dense wood, highly resistant to decay being an excellent material for various sawn wood products (Keresztesi 1988;Nicolescu et al. 2020), also possessing a high combustion potential (Nicolescu et al. 2018), important as short rotation energy crop (Straker et al. 2015;Rédei et al. 2010) and finally, is used in biotherapy, apiculture, and food industries (Beldeanu 2008).
Introduced in Europe in the early seventeenth century (Boring and Swank 1984), its appearance in Hungary dates from 1710-1720 (Keresztesi 1983). In the country, the species spread rapidly due to its ability to utilize a wide range of sites, and in the 1960s, Hungary already possessed more black locust forests than all other European countries combined (Keresztesi 1980). It became one of the most suitable species for establishing environmentally valuable recreational plantations, and nowadays, black locust stands provide approximately 21% of the annual timber supply of Hungary (FAO 2020).
Considering all these previously mentioned ecological, economic, and environmental benefits but also noting that the black locust is amongst the 100 most invasive alien species on the European continent (Vilà et al. 2009), the improvement and management of stands are of paramount importance. Selection in Hungary targeted the production of high-quality trees, an increase in wood production quantity and in wood and honey-production quality (Redei et al. 2008;Nicolescu et al. 2018;Keserű et al. 2021;Ábri et al. 2022). However, breeding by crossing requires persistent work over several decades as the black locust is a heterozygotic species with many features with polygenic background. For that reason, the results of crossbreeding are influenced by chance. With no prior information about the genome, the development of the RADseq technology raised the hope that it would be possible to identify single polymorphic variants with identifiable functional effects on complex traits. However, such a claim requires extensive genetic investigation, since the enzyme selection, sequencing type (paired-end vs. single end), hence the presence of polymorphisms in restriction sites, the possible decrease of coverage, the sequencing errors (Verdu et al. 2016), biases of the de novo assembly, and the data missingness censoring strategy (Tripp et al. 2017) may affect the development of single nucleotide polymorphism markers.
With an aim to reveal patterns of genetic variation at the genome-wide scale, in this study, we performed double digest Restriction Site-Associated DNA sequencing (ddRAD-Seq) to generate unbiased reduced representation libraries of several R. pseudoacacia complete genomes. This dataset provides the opportunity to study not only the genetic background of polygenic traits but also can reveal important insights into the genetic differences between individuals from native American and Hungarian genotypes.

Plant material
The laboratory phase of the experiment was conducted with a single sample/individual from 48 genotypes (Table S1). As one of the main aims was to compare individuals from native American populations with genotypes introduced in Hungary, the sampling was designed to include 16 North American genotypes, all from native habitats. The Hungarian sample series was divided into two major groups. One group with a total of 24 genotypes included old cultivars (Keresztesi 1988) and new candidates with excellent trunk quality, selected for breeding purposes. The second group contained eight samples collected either random, or deliberately from particularly old specimens, all with seedling origin. Thus, the "Bábolna black locust," registered as the oldest Hungarian tree from this species, an old specimen situated in front of the Hungarian Academy, and an individual of a similar age from Uzsa (Hungary) is also present in the study.

Library preparation and RAD-tag sequencing
Total genomic DNA extraction was performed from leaves per each individual using the ATMAB method (Dumolin et al. 1995) with minor modifications. The Qubit dsDNA BR Assay Kit and Qubit 3.0 Fluorometer (Thermo Fisher Scientific, Waltham, MA, USA) were used to quantify the extracted DNA of each sample. A total of 50 ng of DNA per sample was double digested with PstI and MspI (FastDigest restriction enzymes; Thermo Fisher Scientific, Waltham, MA, USA). The enzyme combination was selected based on a preliminary restriction site analysis carried out in CLC Genomic Workbench version 12.0 (QIA-GEN Bioinformatics, Hilden, Germany). Fragments were double-sided size selected using KAPA PureBeads (Roche, Basel, Switzerland), to isolate fragments in the range of 300-600 bp. Inserts were quantified (3 ng), then ligated to adapters (Table S2) by using T4 DNA Ligase according to the manufacturers' protocol (Thermo Fisher Scientific, Waltham, MA, USA).
Purification of the ligated products was performed using 0.8 vol KAPA PureBeads (Roche, Basel, Switzerland) PCR amplification with NEBNext Multiplex Oligos for Illumina (Dual Index Set 1; New England Biolabs, Ipswich, MA, USA) and KAPA HiFi Hotstart Ready Mix (Roche, Basel, Switzerland). The amount of 0.5-0.5 μl of i5 and i7 indexed primers were used per reaction. Thermal cycling conditions were as follows: a 3-min initial denaturation at 95 °C; 17 cycles of 30 s of denaturation at 95 °C, 30 s of annealing at 55 °C, and 30 s extension at 72 °C; and a final 5-min extension at 72 °C. High Sensitivity DNA1000 ScreenTape system with 2200 Tapestation (Agilent Technologies, Santa Clara, CA, USA) and dsDNA HS Assay Kit with Qubit 3.0 Fluorometer (Thermo Fisher Scientific, Waltham, MA, USA) was used to determine the quality and quantity of the amplicon library. Pooled libraries were diluted to 10 pM for 2 × 301 bp paired-end sequencing with 600-cycle sequencing kit v3.1 (Illumina, San Diego, CA, USA). Nucleotide sequences of the libraries were determined on a MiSeq Sequencing System (Illumina, San Diego, CA, USA) according to the manufacturer's protocol.

De novo assembly and SNP calling
All bioinformatic steps were performed on a Silicon Computers (SGI) HPC server, allocating 40 cores (80 threads) and 38 GB RAM, located at the University of Sopron, Sopron, Hungary.
Eight fastq files containing raw short read sequences (from four different runs: four R1 and four R2) of 48 samples each were joined resulting in cca. 96,00,000 reads. MiSeq Control Software (Illumina, San Diego, CA, USA) was used for demultiplexing and adapter-trimming of all sequences. Trimming the bases at the 3′ and the 5′ end with a quality score of less than 30 was performed with the implemented FastQ Toolkit. Reads having mean quality score less than 30 and shorter than 200 bp were filtered. As a next step, computational processing of short-read data was carried out with Stacks 2.0 (Catchen et al. 2013;Rochette et al. 2019). Reads were quality filtered for the second time using a sliding-window method (15% of read length) implemented in "process_radtags." Reads having a quality score below 90% (raw phred score of 10) were discarded (Catchen et al. 2011). RAD loci were reconstructed following the De novo pipeline implemented in "denovo_map.pl" command, in which "ustacks" build loci and call SNPs in each sample. Further, a catalogue of all loci for all the samples was created by "cstacks," then the match of loci of the samples against the catalogue was made by "sstacks" (Catchen et al. 2011;Rochette et al. 2019). Then, we optimized the parameters of the "denovo_map.pl" command by defining (m) the minimum number of reads to consider an allele, (M) the maximum number of mismatches allowed between two alleles, and (n) the maximum number of mismatches allowed between two individual loci to consider them as homologous (Mastretta-Yanes et al. 2015;Paris et al. 2017). These parameters were optimized throughout the "r80" method, by effectively maximizing the number of polymorphic loci found in 80% of the individuals (Paris et al. 2017). The iterative values (ranging from 2 to 10) of M and n were investigated by how they affect the gained polymorphic loci, then the thresholds of M = 3, n = 3 were chosen, applying the M = n rule for the final run (Paris et al. 2017). The default m = 3 value (three identical reads) was selected for the stack-dept parameter. Assembly of paired-end contigs and re-calling the SNPs using the population-wide data (Rochette et al. 2019) was used by "gstacks." Then by "denovo_map.pl" pipeline, we aligned paired-end reads for a total of 210,491 RAD loci, composed of 31,650,195 sites. One hundred forty-seven of these sites were filtered, 220,105 variant sites remained. The mean genotyped sites per locus were 150.36 bp (stderr 0.00). The depth of sequencing coverage is presented in Table S3 and Fig S1; the number of reads incorporated for each sample processed as calculated by "ustacks," in Table S3.
With the "populations" program, we called the set of SNP genotypes. Further processing was accomplished with the VCFtools program package (Danecek et al. 2011). After filtering genotypes called below 50% across all individuals and SNPs that have a minor allele count less than 3, were kept all 48 individuals and 53,711 out of a possible of 220,105 SNP sites. Applying the next filter for a minimum depth for a genotype call and a minimum mean depth, all 53,711 sites were kept. A next step was performed to assess individual levels of missing data by quantifying missingness for each sample. To observe the degree of missingness on a per sample basis, the output file (Table S4) and the histogram ( Figure S2) were examined, and based on the previous, a list of six individuals with more than 50% missing data was created, with the aim to identify these individuals in the dataset. By the depth of sequencing coverage (Table S3), these samples were in the upper coverage range of all samples, but still outside our acceptance threshold as missing.
Following the removal of these individuals (ROB-HU-CULT13, ROB-HU-CULT15, ROB-HU-CULT4, ROB-HU-CULT5, ROB-HU-CULT6, ROB-HU-CULT7), we provided a dataset restricted to variants called in a high percentage of individuals. In a next step, we filtered by mean depth of genotypes. By applying a genotype call rate (95%) across all 42 individuals, our final dataset consisted of 112 SNP sites.
We validated our data with manual, visual, and numerical tests. The quality search of raw short read data of all individuals (48) was assessed using FastQC v0.11.9 (Andrews 2010) by investigating per-base quality and persequence quality at three steps: at raw read state directly after sequencing, after sequence quality trimming (3′ and 5′ end), and following sequence processing (whole read filtering). Results were consistent with a standard Illumina run, and therefore considered to be of sufficiently high quality for further analysis (Kircher et al. 2011).

Acknowledgements
The authors are grateful to Tamás Fonyó (UOS-FRI, Sopron, Hungary) for his assistance in providing computational resources. We also thank Zoltán Bihari (Xenovea Ltd.) for providing sequencing resources and useful comments on data analysis.
Author contribution KC, ABE, and ZAK conceived the study. ZK, JHF, BB, TÁ, and ABE contributed in sampling of plant materials. KC performed laboratory analysis. ZAK and EGT processed genomic data and performed statistical analyses and simulations. ZAK wrote the manuscript. All co-authors provided feedback on the manuscript drafts.
Funding Open access funding provided by University of Sopron. This article was made in frame of the project TKP2021-NKTA-43 which has been implemented with the support provided by the Ministry of Innovation and Technology of Hungary (successor: Ministry of Culture and Innovation of Hungary) from the National Research, Development and Innovation Fund, financed under the TKP2021-NKTA funding scheme.

Conflict of interest
The authors declare no competing interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.