European white oaks are represented in Switzerland by the three widely distributed Q. petraea (Matt.) Liebl. (hereafter sometimes abbreviated as Pet), Q. pubescens Willd. (Pub), and Q. robur L. (Rob). Q. petraea and Q. robur are largely sympatric, though they are often locally separated according to their ecological niche. While Q. petraea is more drought-tolerant, Q. robur is often associated to deep, moist soil that expands into riparian hardwood forests. In turn, Q. pubescens as a sub-Mediterranean species grows mainly in the low-elevation inner alpine area (Valais) and along the calcareous, south-exposed dry Jura slopes within its Swiss range and can often be found in mixed stands with Q. petraea.
Sampling and genotyping
From May to July 2013, we sampled 71 oak tree populations (max. 20 individuals per population) from their entire distribution range within Switzerland (Table S1). We aimed at collecting all three species in all biogeographic regions of Switzerland if present. Information about potential populations were retrieved from previously published studies (e.g., Mátyás et al. 2002; Mátyás and Sperisen 2001), the Swiss National Forest Inventory (Brändli 2010), and the forest soil database of the Swiss Federal Institute WSL (described in Walthert et al. 2013). The minimum distance between sampled trees was 20 m. For morphological analyses, we collected, if possible, one sunlit twig with extendable loppers and used a herbarium press to dry the leaves. For the genetic analyses, we dried four to five young leaves per tree on silica gel.
Per tree, DNA from 25 mg dried leaf tissue was extracted by LGC Genomics (Berlin, Germany) with a KingFisher 96 (Thermo Scientific, Waltham, USA) using the sbeadex maxi plant kit optimized for oak tree leaves (LGC Genomics). We genotyped all individuals with the eight nSSR markers of multiplex kit 2 from Guichoux et al. (2011) as described in the Appendix. Null allele frequencies were calculated with Genepop 4.2.2 (Rousset 2008). No marker exhibited substantially increased null allele frequencies across all populations; only in 23 of 568 cases, estimated null allele frequencies were above 10 %. To convert the genetic data (alleles) for use in multivariate analyses, we used GenAlEx 6.5 (Peakall and Smouse 2012) to perform a principal coordinates analysis (PCoA) on pairwise co-dominant genotypic distances (Smouse and Peakall 1999). The resulting six principal components (accounting for 100 % of the variation) were later used as genetic parameters (PCo1 to PCo6) in the multivariate analyses.
Per tree, we analyzed the morphology of three intact leaves. In total, we measured nine parameters as described and denoted in Kremer et al. (2002): number of lobes (NL), number of intercalary veins (NV), basal shape of the lamina (BS), lamina length (LL), petiole length (PL), lobe width (LW), sinus width at the first lobe (SW), and length of the lamina at the widest lobe of the leaf (widest point, WP). Additionally, similar to SW, we measured the width at the sinus above the widest lobe (SWW).
Since many of the above-described characteristics are dependent on the size of the leaves, we calculated seven relative parameters (of which two were transformed to meet the assumption of the methods applied) from those measured, similar as described and denoted in Kremer et al. (2002): lamina shape or obversity (OB = 100 * WP/LL), petiole ratio (PR = 100 * PL/(LL + PL)), lobe depth ratio at first lobe (LDR = 100 * (LW − SW)/LW), lobe depth ratio at widest lobe (LDRW = 100 * (LW − SWW)/LW), square root-transformed percentage of venation (RPV = SQRT(100 * NV/NL), lobe width ratio (LWR = 100 * LW/LL), and natural log of lobe number ratio (LLNR = ln(NL/LL)), the number of lobes relative to the length of the lamina.
Moreover, all leaves were inspected for the presence or absence of the following five hair types (for examples, see Fortini et al. 2015; Kissling 1977) using a stereo lens: stellate hair on the lamina (LS), clustered (fasciculate) hair on the lamina (LC), intermediate (between stellate and clustered) hair on the lamina (LI), stellate hair on the vein (VS), and clustered hair on the vein (VC). Single hairs were ignored.
For the multivariate analyses, we finally used (a) the seven relative parameters (OB, PR, LDR, LDRW, RPV, LWR, LLNR) and basal shape (BS), all averaged per tree; (b) the five hair parameters (LS, LC, LI, VS, VC) as presence/absence data per tree (presence = present on at least one leaf of the tree); and (c) the scores of the six principal components (PCo1 to PCo6) of the PCoA. For all analyses, we used only the samples with complete morphological measurement and more than five successfully scored nSSR loci (only ten samples yielded less than five loci), resulting in a total of 1369 samples (Table S1).
All multivariate analyses were run in R 3.2.2 (R Development Core Team 2015). We performed two different approaches: a factorial analysis of mixed data (FAMD) followed by a hierarchical clustering on principal components (HCPC) using the FactoMineR package (Lê et al. 2008) and a linear discriminant analysis (LDA) using MASS (Venables and Ripley 2002). Both multivariate analyses were run for three different datasets: using only morphological variables, using only genetic variables, and using all variables.
FAMD is a principal component method to explore data comprising both continuous and categorical variables (Pagès and Camiz 2008). FAMD does not require a prior division into groups. In our case, the continuous variables were the eight morphological variables and the scores of the six PCs on the basis of the genetic data as described above, and the categorical variables were represented by the five hair variables. After analysis, HCPC (Husson et al. 2010) was used to divide the data into clusters keeping the first ten dimensions of the FAMD. The actual number of clusters is suggested by HCPC based on inertia gain. HCPC was performed with a minimum number of clusters of three, as suggested by the authors. Note that the FAMD with only genetic variables does not contain categorical variables and thus represents a principal component analysis (PCA).
LDA is a multivariate analysis that creates discriminant functions that best describe a priori defined groups using a training dataset. Subsequently, it uses these discriminant functions to assign individuals of a separate or larger dataset. In our analysis, we used hair types (that were shown to be important in the FAMD and HCPC, see Section 3) to define the groups (species) in the training dataset. We ignored the presence of intermediate hairs and hairs on the vein and used the samples having only one lamina hair type (stellate lamina hair for Pet, clustered lamina hair for Pub, no hairs on the lamina for Rob, as described in Kissling 1977) as training data. After performing the LDA with standardized and centered values for all parameters (except hair types used for defining the groups), the linear discriminant function was used to predict the entire dataset. In contrast to the FAMD, LDA calculates posterior probabilities of each tree to belong to each of the groups. We used two different probability thresholds to assign trees to a group. First, we applied the “majority rule” where each tree is assigned to the taxon with the highest assignment probability. Second, we used a probability threshold of 0.8. Trees above this threshold were assigned to the respective taxon. All others were tagged as “unclassified”; they represent intermediate, possibly hybrid trees.
Model-based clustering using genetic data
We used Structure 2.3.3 (Pritchard et al. 2000) and Structure Harvester 0.6.93 (Earl and vonHoldt 2012) to group individuals into genetic clusters based on genetic data only. We ran ten simulations with different seeds for K (number of clusters) from 1 to 20, using 1,000,000 repetitions after a burn-in period of 100,000 runs, admixture model, correlated allele frequencies, and no prior location information. Assignment probabilities to the clusters were calculated with Clumpp 1.1.1 (Jakobsson and Rosenberg 2007) using the Greedy algorithm. We then had a closer look at the assignment probabilities for K = 3 (i.e., the number of species involved). Every tree was assigned to a cluster according to the majority rule and the clusters were assigned to taxa based on hair types.
Comparison among different variable sets and approaches
We tested—for both the FAMD/HCPC and LDA (majority rule)—which variable set (morphological variables only, genetic variables only, all variables) best separates among the different clusters. We used the separation/cohesion ratio (described in, e.g., Janert 2010) as quality criterion. A high separation/cohesion ratio indicates that the distance between clusters is large and/or the variation within clusters is small, leading to a good separation of the clusters. Separation (among clusters) and cohesion (within clusters) were calculated after Davies and Bouldin (1979) using distances calculated from the first two dimensions of the multivariate analyses. Separation was calculated as the average distance between the centroids of each cluster. Cohesion was assessed by using the average distance (weighted by cluster size) of each point to the centroid of its cluster. Finally, we calculated how well all six multivariate and the Structure approach corresponded. To do so, we compared the clustering obtained from the software Structure (K = 3) with the grouping/clustering of the FAMD/HCPC and LDA. For both Structure and LDA, cluster assignment was done according to the majority rule.