Comparison of structure analyses and core collections for the management of walnut genetic resources

The preservation of maximum diversity within the smallest number of accessions is one of the challenges of germplasm management, and the assessment of the population structure, the relationships between the accessions, and the construction of core collections are the key steps. The choice of suitable molecular markers is often the starting point. In this study, we analyzed a part of the INRAE walnut germplasm collection, which is highly diverse and unique in Europe, consisting of 150 accessions from American, European, and Asian continents. Based on genotypic data already available, using 13 SSRs or 364,275 SNPs, we showed that the first level of structure is detected equally. We also highlighted empirically that approximately 100 SNPs are needed to obtain similar clustering to 13 SSRs in Principal Coordinate Analysis (PCoA). We constructed eight core collections following two strategies (percentage of total allelic diversity or number of accessions) and two construction methods based on different algorithms (“maximum length subtree” and “entry-to-nearest-entry” methods). We showed that core collections based on few SSR markers are able to capture at least 99.5% of the SNP allelic diversity, irrespective of the construction method used. Then, core collections based on each marker type are highly similar, using both construction methods. All these steps are crucial to identify the suitable tools and methods to improve plant genetic resources management.


Introduction
In the context of climate change and human population growth, plant genetic resources (PGR) are of the utmost importance. They constitute the foundations of agricultural sustainability and global food safety (Ulukan 2011;Ogwu et al. 2014), but they are facing conservation issues. Regarding ex situ management of PGR, this mainly involves seeds and some collections are at risk, due in part to the fact that they are generally underfunded and that evaluation and characterization are often imprecise or inadequate (Fu 2017). In that respect, careful PGR management is crucial from storage to use (Maxted et al. 1997). For clonally propagated perennial species, such as fruit and nut tree species, PGR are generally conserved in ex situ orchards as grafted cultivars, which has its pros and cons: the main advantages are that they can be stored under the climate conditions of their intended use, and can be evaluated during storage; but on the other hand, they require a lot of space and the cost of conservation is significant (Hammer and Teklu 2008).
Nowadays, molecular tools contribute to PGR management (Wambugu et al. 2018), since they can help to find genetically close or synonymous accessions and to create "core collections" which will contain maximum genetic diversity within the smallest number of accessions, leading to a reduction in conservation costs, among other things. Simple Sequence Repeats (SSRs) were frequently used because of their high polymorphism, but the development of high-throughput genomics tools has led to the use of Single-Nucleotide Polymorphisms (SNPs). In the Persian walnut (Juglans regia L.), a species which is widely disseminated and grown in many temperate regions, more than 20 publications mention the use of SSRs (Bernard et al. 2018a). This type of markers was Communicated by M. Troggio Electronic supplementary material The online version of this article (https://doi.org/10.1007/s11295-020-01469-5) contains supplementary material, which is available to authorized users. particularly used to decipher the genetic diversity and the structure of numerous walnut accessions at a continental scale (Ebrahimi et al. 2016). SSRs were also used to perform a landscape genetic analysis across the Asian range of Persian walnut, species native to the foothills of the Himalayas (Pollegioni et al. 2014). The authors showed that reservoirs of high genetic diversity still exist, confirming the importance of the conservation of walnut genetic resources in this area. Recently, a highdensity Axiom™ J. regia 700K SNP genotyping array was developed and validated, initiating a novel genomic era in walnut (Marrano et al. 2019a). This genomics tool was used for instance to investigate the genome-wide population structure in Iranian walnut populations (Arab et al. 2019), and to perform genome-wide association studies (Bernard et al. 2020;Marrano et al. 2019b).
As a result, it is fair to wonder how to conduct the tasks related to germplasm management, such as core collections construction, and the choice of the type of marker to be used is one of the first questions. Neutral SSR loci, due to slippage during replication, usually mutate much more frequently than SNP loci, leading to population-specific alleles that are useful for revealing population structure (Tsykun et al. 2017). On the other hand, since they are usually developed in small numbers for one species, they therefore may not reflect genome-wide genetic diversity contrary to SNP loci that are much more frequent in the genome of most species (Ljungqvist et al. 2010). These two types of marker bring different views of the structure, and the merits of each are listed in Guichoux et al. (2011). SSRs and SNPs were compared in order to assess population structure and relatedness in short-lived species such as rice (Singh et al. 2013), maize (Hamblin et al. 2007;Van Inghelandt et al. 2010;Yang et al. 2011), sunflower (Filippi et al. 2015), bean (Müller et al. 2015), and cowpea (Desalegne et al. 2017). These comparisons are rarer for perennials (grapes (Emanuelli et al. 2013) and jujube (Chen et al. 2017)). A recent work in an apple gene bank compared population structure and relatedness using 15 SSRs and more than 15,000 SNPs. However, studies on core collections construction are lacking.
The aim of this paper is to identify the type of marker and the core collections construction methods more relevant for routine walnut germplasm management.

Plant materials and genotypic data
The plant material consists in 150 unique accessions of Juglans regia from American, European, and Asian continents. The trees are maintained at the Prunus and Juglans Genetic Resources Center located in the Fruit Experimental Unit of INRAE in Toulenne (latitude 44°34′ 37.442″ N-longitude 0°16′ 51.48″ W), near Bordeaux, France (S1 Table). The 150 accessions were selected following two steps. (i) We created a GWAS panel of 170 accessions (Bernard et al. 2020) from SSR and SNP genotyping were described and made publicly available (Bernard et al. 2018b;Bernard et al. 2020).

Structure analyses
Principal Coordinate Analysis (PCoA) was used to determine the structure patterns among the 150 accessions (Fig. 1). Dissimilarities, based on allelic data, were calculated with 10,000 bootstraps, and transformed into Euclidean distances using a power transformation of 0.5. PCoA was performed using the DARwin 6.0.14 software (Perrier and Jacquemoud-Collet 2006), supplemented by the scatterplot3d R package for 3D visualization.
As linked SNPs can account for too much in the population structure variance, particularly in linkage disequilibrium (LD) regions (Price et al. 2008), a pruned subset of SNPs was also used for the PCoA. This filtering was completed using the PLINK 1.9 software, keeping only the SNPs with the higher minor allele frequency, and based on a threshold of r 2 = 0.2 (command-indep-pairwise 50 5 0.2). A subset of 100 SNPs, a number similar to that of SSR alleles, randomly selected, was also tested using the PLINK 1.9 software (command-thincount 100) to compare PCoA results.
The genetic structure of our panel was also investigated using two types of analyses depending on the markers. Bayesian model-based analysis using the STRUCTURE 2.3.4 software was implemented for the SSR markers (Pritchard et al. 2000), following the method of our previous work (Bernard et al. 2018b). The ΔK method (Evanno et al. 2005), implemented in STRUCTURE harvester (Earl and von Holdt 2012), was used to determine the most likely K.
Using SNP markers, a sparse non-negative matrix factorization algorithm was implemented using the sNMF 2.0 software, available as a function of the LEA R package (Frichot and François 2015). This software has a fast and efficient program for estimating individual admixture coefficients from large genomic datasets, and produces results very close to Bayesian clustering programs such as STRUCTURE (Frichot et al. 2014). The choice of the best number of clusters (K) is based on a cross-entropy criterion implemented in the LEA R package. For SSR and SNP markers, thresholds of 0.8 and 0.7 for admixture coefficients, respectively, were chosen to consider one accession as admixed.
Next, the genetic relationships between the 150 accessions were also assessed with the Neighbor-Joining method (Saitou and Nei 1987) using the DARwin 6.0.14 software, for both the SSR and SNP datasets. The Unweighted Neighbor-Joining option was used to build the trees.

Core collections construction
Several core collections based on both marker types were constructed using two methods and two selection thresholds ( Fig. 1). (i) The two methods: we used the "maximum length sub tree" function (Perrier et al. 2003) of the DARwin 6.0.14 software, which looks for a subset of accessions minimizing the redundancy between them, and limiting where possible the loss of diversity (diversity here is expressed by the tree built), and the "entry-to-nearest-entry" method (Odong et al. 2013), implemented in the Core Hunter 3 software (De Beukelaer et al. 2018), which looks for a subset of accessions as different as possible from each other, avoiding selecting a few clusters of similar accessions at the extreme ends of the distribution. (ii) The two selection thresholds: using the two methods mentioned above, we first used the SSR dataset to capture at least 80% or 90% of total allelic diversity, and the number of alleles captured in one SSR-based core collection was also estimated in the SNP dataset (selection threshold based on percentage of total allelic diversity). Secondly, we used a sampling intensity of 33% (n = 50/150) to observe the differences in the selected accessions, using these two methods (selection threshold based on a number of selected accessions).

Results
Comparison of the first level of structure using SSR and SNP markers Using SSR markers, the higher drop of ΔK is for K = 2, followed by a rise for K = 6, and very similar results were obtained using SNP markers, since the higher drop of the cross-entropy criterion is for K = 2, with a curve slope starting for K = 6 ( Fig. 2). To compare the results using SSR and SNP markers, we focused on the first level of structure for K = 2 (S2 Table). The individual admixture coefficients are shown for each of the 150 accessions (Fig. 3). For both marker types, the clustering is linked to the geographic origin of the accessions, as we already described (Bernard et al. 2018b;Bernard et al. 2020). Group A contains the accessions from Eastern Europe and Asia (named "E"), and group B contains the accessions from Western Europe and America (named "W").
Using SSR markers, with a threshold of 0.8 for the individual admixture coefficient, we found 17 admixed accessions, so 88.7% of the accessions were assigned to a group (Table 1). They include 11 French and Californian modern hybrids and 6 accessions from Eastern Europe, Israel, and Central Asia, confirming our previous SSR analysis.
Using SNP markers and a threshold of 0.7, we found 24 admixed accessions (84% of assignment), including 9 of the a b Fig. 2 The most likely K subpopulations. K was evaluated with (a) the ΔK method using SSRs, and (b) the cross-entropy criterion using SNPs accessions, 7 are now clustered into group B, and "UK 215AG12" is now in group A. In addition, 14 accessions from group A, and 1 from group B, based on SSR clustering, are found admixed using SNP markers (Fig. 4). Therefore, only 23/150 accessions (15%) are differently clustered. At all events, we did not find any group exchange from A to B, or from B to A, when comparing the clustering based on SSR or on SNP markers.
When using a threshold of 0.8 for the SNPs, the percentage of population assignment for K = 2 is low (70.7%) but is still high for a threshold of 0.7 (84.0%) ( Table 1), showing that 0.7 and 0.8 are more appropriate thresholds for SNPs and SSRs, respectively. In addition, we ran a Spearman rank correlation test for K = 2 and found that the clustering between SSR and SNP markers is highly correlated, up to 84%.
The PCoAs constructed in 2D and 3D show the clustering of the 150 accessions following the results obtained from K = 2 (Fig. 5). For K = 2, the PCoA results are in agreement with the structure based on SSR or SNP markers, since the scatterplots for groups A and B are well-defined by the first principal component. The admixed accessions are positioned mainly between the two groups. For both methods, the three Manregian walnuts ("Chase C7", "Wepster W2," and "Adams 10") are found to be genetically diverse. Regarding the percentage of explained variation, they are also comparable using both types of markers. The first three axes (x, y, and z) explain 21.86% of the cumulative variation for SSRs, and 14.91% for SNPs.
However, we found some differences between the two types of marker. Using SSR markers, the scatterplot corresponding to group A is more extensive, whereas group B is more scattered using SNP markers. Additionally, this group is split in two, showing the French accessions on one side and the Californian ones on the other.
The Neighbor-Joining method implemented in DARwin 6.0.14 was used to construct grouping trees with the 150 accessions (Fig. 6). The main branching groups of the trees Europe and Asia ("E"), whereas group B in yellow contains the accessions from Western Europe and America ("W") and hybrids from INRAE Table 1 Percentage of population assignment to a group from K = 2 to K = 6, using SSR and SNP markers obtained with both markers are in agreement with the structure results (K = 2), since they are mainly defined by groups A and B. The two accessions "Jin Long 1" from China and "PI 15 95 68" from Afghanistan are among the longest length branches. However, few differences were detected between the structure and the tree, with accessions from group A found in the branching group mainly corresponding to group B: 9 accessions using SSR markers, and 4 accessions using SNP markers.
We then inferred the individual admixture coefficients of the 150 accessions from K = 3 to K = 6 for both markers (S2 Table). Bar plots are given for information in S1-S4 Figs.

Comparison of three sets of SNPs for PCoA assessment
In addition, for PCoA assessment, we compared the entire set of 364,275 SNPs with a subset of SNPs filtered for LD, with a Fig. 4 SNP-based clustering results, compared with SSR-based clustering results. For a threshold of 0.7 for admixture, SNPs clustered 14 accessions found in group A, and one found in group B, compared with SSRs, into the admixed group. Conversely, SNPs clustered one and seven accessions found to be admixed, compared with SSRs, into groups A and B, respectively. There was no clustering exchange between groups A and B, when comparing both markers threshold of r 2 = 0.2 (Fig. 7). We then retained 24,422 SNPs, or 6.7% of the entire set. Interestingly, the results found for both datasets are strongly similar, with the scatterplots well-distinguished, according to the K = 2 results (Fig. 7). We still distinguish the French accessions from the Californian accessions within group B. The main difference is that the variance is better explained by the first three axes using the LD-pruned set (18.90% vs. 14.91%). By comparing the entire set with a random subset of 100 SNPs, a range comparable with the total number of SSR alleles, we found that the scatterplots are less well-defined, but still in agreement with the K = 2 results (Fig. 7). In this case, we cannot distinguish the French accessions from the Californian accessions within group B.

Comparison of core collections using SSR and SNP markers
In light of the results obtained concerning structure analyses (13 SSRs give similar results compared to 364,275 SNPs), we decided to construct first four core collections based on SSR markers. We created them with both core collections construction methods and two thresholds, to capture at least 80% or 90% of total allelic diversity. To capture at least 80% of total SSR allelic diversity, we need 16 accessions using the method based on the "maximum length sub tree" function of DARwin 6.0.14, and 15 accessions using the "entry-to-nearest-entry" method of Core Hunter 3 (Table 2). Next, to capture at least 90% of allelic diversity, we need 27 and 32 accessions for each construction method, respectively. The accessions Fig. 5 Principal Coordinate Analysis scatterplots. PCoAs were constructed in 3D using (a) SSR, and (b) SNP markers, and in 2D using (c) SSR, and (d) SNP markers. The 150 accessions are colored following the K = 2 results: group A in red, group B in yellow, and admixed in grey belong mainly to group A, from Eastern Europe and Asia, known to be more diverse: 11/16 and 12/15 respectively for 80%, and 20/27 and 21/32 respectively for 90%. They include, for instance, Iranian accessions, the Indian "Sopore" and several accessions from the Botanical Garden of Kiev. Regarding the French accessions, few landraces such as "Corne," "Marbot" and "Grandjean1" were kept only when capturing 90% of allelic diversity using "Core Hunter 3".
Moreover, the consistency of the results between both percentages of total captured allelic diversity was checked for the two core collection construction methods. Using Darwin 6.0.14, the 16 accessions selected to capture 80% of allelic diversity were found among the 27 selected to capture 90% (Table 2). However, using Core Hunter 3, 2/15 accessions selected to capture 80% were not found among the 32 selected to capture 90%. Hence, for 80%, 9 accessions (among 16 or 15) are in common between the two methods, and 19 accessions for 90% (among 27 or 32). We then estimated the number of alleles captured in the SNP dataset for each core collection. For the two SSR-based core collections capturing at least 80% of total allelic diversity, they also capture more than 99.5% of total SNP allelic diversity. For 90%, they reach more than 99.9%.
Next, we constructed four additional core collections using the same sampling intensity threshold of 33% to select 50/150 accessions, with both methods and both markers. Results show that 34/50 accessions are in common between the SSR and SNP markers using Darwin 6.0.14, and 32/50 accessions using Core Hunter 3 (Table 3). The accessions belong mainly to group A, from Eastern Europe and Asia, known to be more diverse (between 30 and 31/50 using SSRs, and between 22 and 23/50 using SNPs). They both include the Iranian accessions, the Chinese "Jin Long 1," "PI 15 95 68" from Afghanistan, the Indian "Sopore," the Bulgarian "Izvor 10" and "Plovdivski," and several accessions from the Botanical Garden of Kiev.
We also checked the consistency of the results between the two core collection construction methods for both markers. Using SSR markers, 39/50 accessions are in common between the two methods, and 45/50 accessions using SNP markers. Regarding the number of alleles, for SSR markers, we captured 97.1% and 94.3% of the 105 total alleles found within the entire collection, using the "maximum length subtree" and the "entry-to-nearest entry" methods, respectively. Using SNP markers, we captured more than 99.9% of the 728,550 total alleles found with the two methods (Table 3).

Discussion
The use of either SSR or SNP markers shows comparable results for the collection structure By genotyping the panel of 150 J. regia accessions using 13 SSRs and more than 300,000 SNPs, we obtain similar results. Both types of marker showed a first level of structure for K = 2, with no exchange of accession between the main groups (A and B), which are related to the geographical origin of the accessions. Similar findings were reported in Ebrahimi et al. (2016), with K = 2 mainly corresponding to a group from Asia, and a group of European and North Africa. The exchanges observed between both types of marker only concern Fig. 6 Neighbor-Joining trees. Trees were constructed using (a) SSR, and (b) SNP markers. The 150 accessions are colored following the K = 2 results: group A in red, group B in yellow, and admixed in grey accessions that switched from one group to the admixed cluster, or vice versa. From K = 3 to K = 6, the results remain highly comparable, with the highlighting of substructures linked, for instance, to "Payne" pedigree for the Californian Fig. 7 Comparison of SNP sets for Principal Coordinate Analysis. PCoA were constructed using (a) the entire set of 364,275 SNPs, (b) the LD-pruned subset of 24,422 SNPs, and (c) the random set of 100 SNPs modern hybrids, or to geographical area (South-West vs. South-Est) for French landraces. When comparing structure results for K = 2 with PCoAs, and with grouping trees, the results are still consistent. More precisely, we found as the most diverse accessions, using both markers, the three Manregian walnuts originating from seed collected in northeastern China and the two accessions "PI 15 95 68" and "Jin Long 1," both coming from Asia (Bernard et al. 2018b). Table 3 Construction of the core collections using both marker types, SSRs and SNPs, and two different methods, to keep n = 50 accessions   Number of captured alleles: 728,420/728,550 (>99.9%) a Accessions (K analyses name) indicated in grey are in common between SSR and SNP markers, for both core collection construction methods b Admixture threshold of 0.8 using SSRs, and 0.7 using SNPs a Accessions (K analyses name) indicated in gray are in common between SSR and SNP markers, for both core collection construction methods b Admixture threshold of 0.8 using SSRs, and 0.7 using SNPs  a Accessions (K analyses name) indicated in grey are in common between 80% and 90% of total allelic diversity thresholds, for both core collection construction methods b Admixture threshold of 0.8 a Accessions (K analyses name) indicated in gray are in common between 80% and 90% of total allelic diversity thresholds, for both core collection construction methods b Admixture threshold of 0.8

Markers and construction methods must be compatible with the preservation of phenotypic variability
In light of our results, we found that SSR-based core collections could capture more than 99.5% of SNP total allelic diversity. SSR markers, when properly selected using the literature, may be the markers of choice to achieve this goal. Using SSRs, our analyses preferentially selected accessions from East Europe and Asia, as expected, because of their global higher genetic diversity, and helped us to understand that French landraces have a moderate level of genetic diversity. Given that French landraces represent 20% of the entire plant material panel, these findings confirm that their diversity is moderate. However, here it is important to note that the accessions selected in one core collection, even when slightly different, are not always the same in the repeated runs in Core Hunter 3, contrary to in Darwin 6.0.14. This feature, algorithm-dependent and more frequent when the panel has redundancy (Vargas et al. 2016), makes the comparison between construction methods more difficult.
In parallel with the preservation of allelic diversity, it is also necessary to take phenotypic variability into account. The INRAE walnut germplasm collection contains some accessions with unusual traits, such as weeping branches, laciniate leaves, or purple foliage, which may be used for ornamental purposes. Interestingly, the accession with weeping branches "Pleureur" is in the two core collections capturing 90% of SSR total allelic diversity, and the accession with laciniate leaves is in the two core collections constructed using Core Hunter 3. Based on chronological phenotypic data available (Bernard et al. 2019) and new data acquired, we also explored whether the core collections contain a high or low range of variability of some important traits. For the budbreak date for instance, crucial for climate change adaptation, the ten earlier accessions are "Early Ehrhardt," "Mire," "Payne," "Serr," "Kfar Hanania," "IR 60-1," "Sopore," "Z 53," "Ashley," and "Lu Guang," with a range of budbreak date from 65 to 75 Julian days. Five among the ten are found on the core collection "90% -Darwin." Conversely, none of the ten later accessions ("Fertignac," "Le Bordelais," "St Jean n°1," "Lalande," "Candelou," "Maribor," "Semence Comité Dordogne," "Ronde de Montignac," "Culplat," and "Romaine"), with a range of budbreak date from 110 to 122 Julian days, is found in the four core collections. Curators are well aware of the importance of phenotype knowledge for an effective PGR management. Currently, new ways to construct core collections are being evaluated to maximize the frequency of minor alleles and phenotypes (Vargas et al. 2016). An interesting application of such "custom methods" in carrot is well-described in Corak et al. (2019). The authors suggest that a core subset based on "custom methods," e.g., core balancing functional and genetic diversity based on high-density genomic data and estimated breeding values, would be directly useful for breeders and more efficient for breeding programs.

Conclusion
In our comparison using 150 J. regia accessions, both SSR and SNP markers were highly efficient for highlighting the INRAE walnut collection structure, even though the number of SSRs is much lower than the number of SNPs. Moreover, similar results were obtained for core collections, irrespective of the construction method and the marker type used. The strategy for core collection construction can be based on several criteria and must meet germplasm management requirements (genetic diversity) and constraints (cost and surface of conservation). The lower the number of selected accessions, the higher the probability of discarding a genotype with traits of interest. Thus, this highlights the importance of phenotypic evaluation in germplasm management.
It is therefore important to consider the task of germplasm management when choosing the most appropriate marker. In general, few SSR markers are suitable and sufficient for obtaining a global idea of the structure and the genetic diversity of a germplasm, and for constructing a core collection.
Bernard, Teresa Barreneche, and Elisabeth Dirlewanger interpreted the results. Anthony Bernard wrote the manuscript. Anthony Bernard, Teresa Barreneche, Armel Donkpegan, and Elisabeth Dirlewanger contributed in the discussion of the results. All authors reviewed the manuscript. All authors read and approved the final version of the manuscript.
Funding This work has been mainly funded by the "Région Nouvelle-Aquitaine" in the project "INNOV'noyer," coordinated by the CTIFL, and in partnership with the INRAE of Bordeaux. This work has been also funded by the "Cifre" convention number 2016/1558 of ANRT (Agence Nationale de la Recherche et de la Technologie). INRAE funded part of the PhD grant of Anthony Bernard.

Compliance with ethical standards
Competing interests The authors declare that they have no competing interests.
Ethics approval Not applicable.
Consent to participate Not applicable.

Consent for publication Not applicable.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.