Predicting the geographic origin of Spanish Cedar (Cedrela odorata L.) based on DNA variation

The legality of wood products often depends on their origin, creating a need for forensic tools that verify claims of provenance for wood products. The neotropical tree species Cedrela odorata (Spanish cedar) is economically valuable for its wood and faces threats of overexploitation. We developed a 140 SNP assay for geographic localization of C. odorata specimens. Target capture and short-read sequencing of 46 C. odorata specimens allowed us to identify 140 spatially informative SNPs that differentiate C. odorata specimens by latitude, temperature, and precipitation. We assessed the broad applicability of these SNPs on 356 specimens from eight Cedrela species, three tissue types, and a range of DNA mass inputs. Origin prediction error was evaluated with discrete and continuous spatial assignment methods focusing on C. odorata specimens. Discrete classification with random forests readily differentiated specimens originating in Central America versus South America (5.8% error), while uncertainty increased as specimens were divided into smaller regions. Continuous spatial prediction with SPASIBA showed a median prediction error of 188.7 km. Our results demonstrate that array SNPs and resulting genotypes accurately validate C. odorata geographic origin at the continental scale and show promise for country-level verification, but that finer-scale assignment likely requires denser spatial sampling. Our study underscores the important role of herbaria for developing genomic resources, and joins a growing list of studies that highlight the role of genomic tools for conservation of threatened species.


Introduction
Biodiversity loss is of global concern, and is due in part to deforestation and high consumer demand for wood and wood products (Nellemann 2012;Elias 2012;van Zonneveld et al. 2018). Forests of Central and South America (or "neotropical" forests) face the largest threat because they support the most terrestrial biodiversity, with an estimated 16,000 tree species contained within the Amazon rainforest alone (Pennington et al. 2015;Pennington and Lavin 2016;Dick and Pennington 2019). Thirty-five to 72% of wood sourced in the Amazon is thought to be acquired from illegal logging (Saunders and Reeve 2014), and illegal logging accounts for 50-90% of forestry activities across tropical forests globally (Hoare 2015;Sheikh et al. 2019). Laws are in place to protect economically valuable tree species from overexploitation and promote sustainable practices (e.g., U.S. Lacey Act [2008]; European Union timber regulation [2010]; Australian Illegal Logging Prohibition Act [2012]; Japanese Clean Wood Act [2017]), but these remain difficult to enforce because of the sheer scale of illegal logging, and the challenge of identifying protected species and their countries of origin, especially after wood is transported from the site of harvest, processed, and enters commercial markets (Dormontt et al. 2015(Dormontt et al. , 2020Wiedenhoeft et al. 2019).
Illegal logging affects many tree species, but highly valuable-often rare and endangered-species are common targets. Spanish cedar (Cedrela odorata L.; Meliaceae) and congeners are among the most valuable neotropical Electronic supplementary material The online version of this article (https ://doi.org/10.1007/s1059 2-020-01282 -6) contains supplementary material, which is available to authorized users. hardwoods, making them particularly vulnerable to illegal harvesting. In 2001, C. odorata was listed under the protections of the Convention on International Trade in Endangered Species of Wild Fauna and Flora (CITES) Appendix III requiring validated documentation of species identity and source for both export and import documentation, protecting populations in Bolivia, Brazil, Colombia, Guatemala, and Peru (Ferriss 2014), and in 2019, C. odorata and all species in the genus Cedrela were elevated to CITES Appendix II, due to the similarity of sawn logs and processed wood across species (Gasson 2011). However, CITES protection does not entirely eliminate illegal harvesting. An investigation focused on logging of Spanish cedar (C. odorata) and mahogany (Swietenia macrophylla King) in Peru found 112 CITES export records of illegal wood listing origins that had been fabricated (Urrunaga et al. 2012). Urrunaga et al. (2012) provided evidence that illegal logging is systematic, common, and environmentally and economically damaging in Peru. In 2017, a snapshot of the scale of illegal logging in Peru was provided by the same team of investigators; a cargo ship from Peru, the Yaca Kallpa, was seized and contained nearly 10,000 m 3 of illegal wood (Conniff 2017;Bargent 2017). Despite the anticipated increase in global scrutiny for Cedrela wood and wood products regardless of geographic origin, tools for predicting the origin of C. odorata wood remain valuable to process seizures predating CITES Appendix II listing, and could be valuable for verifying supply chains and identifying responsible parties.
Forensic tools that can accurately predict the geographic origin of wood have potential to assist the enforcement of trade restrictions for protected species. Genetic approaches have been used to determine the origin and trade routes of protected species, including elephant ivory (Wasser et al. 2004(Wasser et al. , 2018, sturgeon and paddlefish caviar (Doukakis et al. 2012;Ogden et al. 2013), tigers (Linacre and Tobe 2008;Gupta et al. 2011;Kitpipit et al. 2012), birds (Abe et al. 2012;Coghlan et al. 2012;White et al. 2012), fishes (Zarraonaindia et al. 2012;Clemento et al. 2014), and plants (Ogden et al. 2008;Degen et al. 2013;Blanc-Jolivet et al. 2018;Paredes-Villanueva et al. 2019). Anatomical, chemical, genetic, and isotopic methods have all been applied to address questions of taxonomy and origin of wood (Dormontt et al. 2015); however, a single method does not typically address questions of both taxonomy and source. Cedrela odorata and closely allied species will likely require multiple techniques for validation of taxonomy and origin because its wood lacks the anatomical features required for discrimination among species (Gasson 2011), and variation in wood chemistry does not vary in a manner that is geographically predictive (Paredes-Villanueva et al. 2018). While methods for DNA extraction and recovery from wood are improving (Dumolin-Lapègue et al. 1999;Asif and Cannon 2005;Rachmayanti et al. 2006;Tnah et al. 2012;Jiao et al. 2012Jiao et al. , 2018Yu et al. 2017;Dormontt et al. 2020), genetic markers of short length, such as single nucleotide polymorphisms (SNPs), are increasingly being used in wildlife forensics because they are suitable for low concentration, degraded DNA extracts (Ogden et al. 2009). Here, we evaluate the power of SNPs to resolve geographic origin of C. odorata across much of its range in Central America and western South America.

Materials and methods
Development of the SNP genotyping array for this study first involved the sequencing of a design panel of C. odorata specimens. From SNP variants detected in a design panel of specimens, we selected 140 candidate SNPs for the genotyping array on the basis of geographic and environmental differentiation. DNAs from a screening panel of 376 specimens were genotyped with these 140 array SNPs, and we were able to assess genotyping efficiency for DNAs derived from eight Cedrela species, three tissue types, a range of mass inputs. We evaluated discrete and continuous spatial origin prediction methods on a group of specimens from the screening panel, representing C. odorata and closely allied taxa (referred to as C. odorata sensu lato). Methods for each of these steps are described below.

Design panel sequencing
We used hybridization-based target capture and massivelyparallel sequencing (Cronn et al. 2012;Heyduk et al. 2016) to identify SNPs from a panel of 46 C. odorata herbarium specimens (Appendix 1; Table S1; Fig. S1) from Peru and surrounding countries. Hybridization capture probes were designed based on gene models of a C. odorata individual originating in Mexico as described in Finch et al. (2019). Sequencing yield and depth were assessed using methods previously described (Finch 2018;Finch et al. 2019) and included in the Supporting Information for this article (Appendix 1; Table S2) (Finch 2019a, b). One Peruvian specimen (C. odorata 300; Table S1) was selected as the nuclear reference, and captured sequence reads were assembled de novo using SPAdes (v. 3.6.1;Bankevich et al. 2012). These enriched nuclear contigs were filtered to ensure that assembled sequences contained the target probe sequences, and did not contain sequences with identity to the C. odorata chloroplast genome  or mitochondrial genes (Kuravadi et al. 2015) (Appendix 2; Table S3).
Sequence reads from the 46 specimens were aligned to the C. odorata 300 reference using BWA-MEM (v. 0.7.17;Li and Durbin 2010;Li 2013), and sequence alignments were used as inputs for probabilistic variant calling using SAMtools (v. 1.10;Li et al. 2009) and the Genome Analysis Toolkit (GATK) (v. 3.7;McKenna et al. 2010). Best-practice guidelines for GATK variant calling were used, including indel region realignment, and high-stringency variant filtering for coverage, mapping quality, and variant position. Initial SNP filtering was performed with VCFtools (v. 0.1.17; Danecek et al. 2011) to remove insertion/deletion variants, sites with greater than 85% missing data, sites with more than two alleles, and sites with a minor allele frequency (MAF) lower than 5%. By applying these filtering parameters, we sought to provide a set of 'high confidence' SNPs for further analysis and eventual candidate selection for inclusion on a genotyping array.

Candidate SNP selection and SNP assay development
We identified SNPs showing the highest differentiation in allele frequencies (F ST ) (Weir and Cockerham 1984) for groups based on latitude (LAT; decimal degrees), mean annual temperature (MAT; °C × 10), and annual precipitation (AP; mm). In this way, F ST was used to measure the partitioning of allelic variance in alternative groupings, not to make specific statements or inferences of population genetic parameters (e.g., probability of identity by descent, panmixis, or migration). Specimens were divided into a northern and southern LAT group based on a gap in the sampling distribution at 7.5° S latitude (Fig. 1a). Specimens were grouped into low, moderate, or high MAT (low < 20 °C; 20° < moderate < 25 °C; high > 25 °C; Fig. 1b) and AP (low < 2000 mm; 2000 < moderate < 3000 mm; high > 3000 mm; Fig. 1c) categories based on their tercile rank for these climate variables (Table S1) at their geographic source based on the WorldClim 2 dataset (Fick and Hijmans 2017) using R (see Table S7 for R packages and citations; R. Core Team et al. 2013;Finch 2019a, b). These measures exhibit low pairwise correlations (LAT × AP, r 2 = 0.29; LAT × MAT, r 2 = 0.01; AP × MAT, r 2 = 0.14; Fig. S2), so genetic associations with these gradients are likely to include SNPs that respond to different neutral and selective forces.
LAT, MAT, and AP groups were then treated as 'populations' for calculating F ST on a per-marker basis with VCFtools resulting in three lists of SNPs associated with these groups. To develop a SNP assay based on these loci, we sorted each SNP-F ST list in descending order, concatenated the 150 top F ST SNPs from each group into a single list, and filtered redundant SNPs to one SNP per contig of reference sequence. If two or more non-identical SNPs appeared on the same contig, we selected the SNP position with the highest F ST . We further filtered the list to retain SNPs with a MAF ≥ 20% to avoid SNPs that were nearlyfixed across much of the geographic range. This resulted in 152 high F ST and high MAF SNPs, none of which showed evidence of linkage disequilibrium (r ≤ 0.3). Contig names and SNP positions were converted into BED format, and BEDtools (v. 2.25.0; Quinlan and Hall 2010) was used to extract a multi-fasta file containing positions ± 100 bp flanking each SNP. The multi-fasta was used in primer design and multiplexing for a MassARRAY iPLEX® assay (Agena Bioscience, Inc., San Diego, CA, USA), using the MassAR-RAY Assay Design Suite software (Agena Bioscience, Inc.). From this list of sequences, we identified 140 SNP loci that could be evaluated in 4 multiplex groups. Resulting mass spectra were scored using Typer Viewer (v. 4.0.24.17; Agena Bioscience). All steps of SNP analysis (assay design, oligonucleotide synthesis, amplification, mass spectrometry, and SNP calling) were performed by NeoGen Genomics Inc. (Lincoln, NE, USA).

SNP assay screening
Samples screened for our selected SNPs derived from three sources (Table S4;  Moreno (Santa Cruz, Bolivia). DNA was extracted using a modified CTAB procedure (AG-Biotech LLC, Monterey, CA, USA) from herbarium-derived dry leaf tissue from MO (150-200 mg from fragment packets). Leaf-and cambiumderived DNA from La Molina/Cayetano/SERFOR were extracted from fresh tissue using a modified CTAB protocol (Healey et al. 2014). These samples yielded < 100 ng of total DNA, so we used whole-genome amplification (Genomi-phi™ V2 Amplification Kit, GE Healthcare, Chicago, IL, USA) to amplify the DNA using manufacturer's instructions. DNA from the remaining 86 samples was derived from fresh material dried in silica gel according to Dumolin et al. (1995) at VTI or IIAP. All samples were quantified via fluorometry (Qubit, ThermoFisher Scientific, Waltham, MA, USA).
MassARRAY SNP genotyping was performed on 376 samples representing 356 unique specimens and 20 technical replicates. Included in our assay were representatives of C. odorata  Table S4) were examined across a range of DNA mass inputs (50,75,100,200, and 300 ng). Our target DNA mass was 300 ng (recommended by NeoGen Genomics Inc.), although some samples had as little as 50 ng. In total, 330 DNAs derived from leaf tissue (primarily herbarium specimens) and 26 derived from cambium.

Assay assessment
To evaluate the broader use of the SNP array, SNP call rates (the proportion of successful diploid SNP genotyping calls at 140 diploid loci; Table S4) were determined for all samples, species, tissue types, and input concentrations. For population comparisons involving replicated samples (C. odorata 83, C odorata 291), we chose replicates with the highest call rates (50 ng dilutions in both cases). Loci were removed from all analyses if genotyping call rates were < 0.75 across specimens to strike a balance between missing information and sample retention. R packages adegenet (Jombart and Ahmed 2011), poppr (Kamvar et al. 2014), and hierfstat (Goudet 2005) were used to calculate observed heterozygosity, MAF, and F ST for each SNP. F ST was calculated on a per-marker basis treating species as populations, and by comparing northern and southern LAT groups (described above) as populations for C. odorata based on herbarium labels or field identification.
The R package adegenet was used to evaluate genetic structure among screening panel specimens with Discriminant Analysis of Principal Components (DAPC) (Jombart and Ahmed 2011), an ordination method based on genetic distances. We used DAPC clusters to define a reference database of specimens representing C. odorata and closely allied species (referred to as C. odorata sensu lato; n = 190; Appendix 3).

Discrete spatial classification with random forests
We classified C. odorata into discrete regional groups based on SNPs using random forests, a classification method that provides a consensus classification based on 'a forest' of many classification trees (Breiman 2001). Since our screened specimens represented dispersed samples and not necessarily populations, we designed classification tests to determine the classification accuracy obtained with our specimens and the SNP array at three geographic scales: (i) 'range-wide' with categories "Central America" and "South America," (ii) 'target countries' with categories "Ecuador," "Peru," and "Bolivia," and (iii) 'narrow regional,' within our target countries, with categories "NW," "NE," "SW," and "SE." Narrow regional groups were selected to represent an area approximately the size of Peruvian departments, and to maintain approximately equal sample sizes within groups.
For this analysis, we used our reference database of C. odorata s. l. and loci showing a call rate ≥ 0.75 with genotypes coded as categorical variables '0' (genotype homozygous for the reference allele), '1' (heterozygous), or '2' (homozygous for the alternate allele). Since random forests is not tolerant of missing data, we used the R package synbreed (Wimmer et al. 2012) to impute allelic data on a perlocus basis from sampled genotypes under the assumption of Hardy-Weinberg equilibrium.
We generated a random forest of 500 classification trees for each classification question allowing each tree to have as many branches as loci (mtry = number of loci minus 1). In each case, predictor variables for classification were the loci and region of origin was the grouping variable for each specimen. To avoid biasing misclassifications, sample sizes for each regional class in the grouping variable were held constant, with the sample size determined by the class with the smallest number of specimens (Table 1) (Sun et al. 2009). Since classification error varies in random forests due to random sampling (i.e., random starting specimen and random starting locus), we calculated the mean of the median error across 5000 random forests (2,500,000 total classification trees), and evaluated the range and distribution of errors. Observed classification error was compared to random expectations by randomizing the classes by the grouping variable. This method was used to understand the baseline random classification accuracy for random forest tests with different numbers of classes (Finch et al. 2017). Table 1 Abbreviations used to identify each random forest classification model, descriptions of regional classes examined, the number of samples per class used to train the model (n), and results for each model. Note: n is limited by the sample size for the regional class with the fewest samples Model Continuous spatial assignment with SPASIBA Spatial classification methods have been developed to provide continuous estimates of origin based on an interpolated surface of allele frequencies. The R package SPASIBA (Guillot et al. 2016) uses a training set of georeferenced genotypes to predict the highest probability origin for test genotypes. SPASIBA estimates the spatial auto-covariance of allele frequencies assuming that covariance diminishes with increased geographic distance (i.e., isolation-by-distance).
Allele frequencies for individuals geographically proximate to those included in the training set are estimated under the assumption of a population in Hardy-Weinberg equilibrium, and loci are assumed to be in linkage equilibrium. Predicted origins for "unknowns" can be estimated for areas where no training genotypes exist. We used the same data in SPASIBA analysis as was used in random forest analysis above with the exception that we limited the geographic scope of our analysis to specimens from the target countries (Ecuador, Peru, Bolivia) and additional samples from Brazil and Bolivia below − 17.5° S latitude (n = 148). We used two cross-validation methods to assess the performance of SPASIBA. The first method was a k-fold cross-validation (k-fold CV); we modeled spatial auto-covariance of allele frequencies by randomly selecting 90% of C. odorata specimens (n = 133) as a training set, and used the remaining 10% (n = 15) as validation samples to determine the error of predicted origins. K-fold CV follows recommendations to assess model performance, avoid overfitting, and increase computational efficiency (Lever et al. 2016). In practical use, however, a legal inquiry involving wood would likely employ all reference specimens to identify the origin of confiscated material. To assess the predictive accuracy of SPASIBA via this framework, we performed a leave-one-out cross-validation (LOOCV) by predicting the origin of each specimen based on composite allele frequencies from all other available specimens (n = 147). By including both cross-validation techniques, we: (i) gain an understanding of the range of prediction errors for a single individual as a function of different training and validation sets, (ii) provide a conservative estimate of prediction error that is less prone to overfitting, and (iii) obtain the optimal predictive accuracy for each specimen in our dataset. For both cross-validation methods, we defined prediction error as the distance between the known and predicted origins. Spatial predictions were made in decimal degrees; these were converted to Haversine distances (in kilometers) with the R package geosphere (Hijmans 2016).
For the k-fold CV analysis, we tested the spatial resolution of predicted origins with data sets containing missing data and imputed data separately (imputation as described above) because SPASIBA is tolerant of missing data. Since the selection of training samples influences the accuracy of continuous geographic assignment (Guillot et al. 2016), we repeated the SPASIBA analysis 100 times with each data set (missing data; missing data imputed) to evaluate the distribution of prediction error. We compared the k-fold CV results to our random forest analysis by filtering the k-fold CV results to show prediction error for the specimens used for the target country and narrow regional analyses. As with random forests, observed classification error was compared to random expectations by randomizing genotypes of the geo-referenced training data set, and repeating predictions for the validation samples.
With the LOOCV analysis, we were interested in knowing whether the optimal predicted origin and assignment errors were related to sample density. To evaluate this, we computed the mean pairwise geographic distance to the ten nearest neighbors for each sample, and assessed the relationship between the mean 'nearest-neighbor distance' and prediction error for each sample (Appendix 4).
All maps were drawn using the base map shapefiles from the World Borders Dataset (https ://thema ticma pping .org/).

SNP assay development
Target capture and short read sequencing of the design panel of C. odorata specimens (n = 46; Appendix 1) resulted in 4.4 × 10 8 paired sequence reads (9.0 × 10 10 bp total) with a mean individual sequence yield of 9.6 × 10 6 paired reads (range: 6.3 × 10 5 -4.7 × 10 7 ). The sequence yield for the 'reduced representation' nuclear reference for C. odorata 300 was 1.4 × 10 7 paired reads (2.8 × 10 9 bp; Table S2). The C. odorata 300 de novo assembly (Appendix 2) yielded 9,139 assembled contigs with a mean length of 982.5 bp (range 156-4,053 bp; sum of length = 9.0 × 10 6 bp; Table S3), and was used for read mapping and variant calling. On average, 3.6 × 10 6 reads mapped to each target reference contig (range 1.8 × 10 5 -1.7 × 10 7 reads; Table S2) for an average depth of 53.7X per target (range 1.1X-443.6X; Table S2). Estimated mean individual depth ranged from 5.9X to 277.3X (Table S2). Our initial vcf contained 1.6 × 10 6 sequence variants before filtering to remove insertion/deletion variants (5.9% of total variants), multi-allelic variants (7.8%), SNPs with greater than 85% missing information (46.1%), and SNPs with a MAF less than 5% (31.1%). The resulting, filtered sequence matrix of C. odorata specimens from target countries (n = 46; Table S1; Fig. 1a-c) included 144,083 SNPs, and was used as the basis for evaluating allelic associations with geographic and climatic variation, and for developing a SNP assay for spatial assignment. Figure 1 shows the geographic distribution of samples for each grouping (LAT, MAT, and AP) used to identify spatially informative SNPs via F ST (Fig. 1a-c), as well as their predicted values (Fig. 1d-f). SNPs selected for AP showed the strongest allelic differentiation relative to SNPs selected for LAT and MAT, with a mean F ST of 0.42 (interquartile range: 0.41-0.45; Fig. 1g). SNPs based on MAT showed a similar median F ST of 0.44 (interquartile range 0.43-0.46; Fig. 1g) with a higher mean F ST (0.46) and a higher maximum F ST (0.62). Surprisingly, LAT SNPs showed lower differentiation, with a median F ST of 0.23 (interquartile range 0.22-0.26 Fig. 1g). Our reduced SNP assay included 61 SNPs from the AP list, 53 SNPs from the MAT list, and 26 SNPs from the LAT list (Fig. 1h). These SNPs were converted to an Agena MassARRAY assay, and together array SNPs from the LAT, MAT, and AP lists showed a mean F ST of 0.41 and per-SNP F ST values ranged from 0.21 to 0.62.

SNP assay results
Across all samples and SNPs screened by MassARRAY, we observed 22.3% missing data (23,708 uncalled alleles out of 106,400 potential alleles), with specimens showing call rates (CR) of 0.00 to 0.96 across 140 SNPs. Three factors influenced CR: (i) DNA concentration, as input DNA mass above 50 ng resulted in decreasing CR (Fig. S4); (ii) the use of herbarium leaves, which showed a lower CR on average than other sources (Fig. S5); and (iii) the inclusion of multiple species, as DNAs derived from C. angustifolia and C. montana showing substantially lower mean CR than other species (CR = 0.62 and 0.71, respectively; Fig. S6). We discarded loci showing CR's < 0.75 (i.e., 25 additional loci) to mitigate the impact of missing data. This yielded a Cedrela dataset with 352 individuals (four individuals discarded as failures), 99 loci, and 1.32% missing data. This dataset showed a mean observed heterozygosity of 0.05 (range 0-0.37) and a mean MAF of 0.26 (range 0.02-1). Treating species as populations, we calculated a median F ST of 0.20 (range 0.01-0.56). Filtering for only C. odorata from South America (defined by herbarium labels and field identifications) produced a dataset with 135 specimens that showed a mean call rate of 0.81 (range 0.1-0.95). This C. odorata dataset included 99 loci and 1.01% missing data, and yielded a mean observed heterozygosity of 0.11 (range 0-0.44), a mean MAF of 0.34 (range 0.04-1), and a per-SNP median F ST of 0.01 (range − 0.02 to 0.25) for geographic groups similar to those shown in Fig. 1a.
Clusters 9 and 6 were excluded because they were largely composed of specimens identified as C. angustifolia and C. montana or C. fissilis, respectively (Appendix 3; Fig.  S7b). Cluster 3 was also excluded because it showed the lowest mean CR (0.34 compared to CR > 0.7 for all other clusters). The resulting dataset included 190 C. odorata s. l. from Central and South America used for random forest classification, and 148 C. odorata s. l. for SPASIBA analyses focused on South America. After defining the C. odorata s. l. dataset and removing individuals with high levels of missing information, we were able to retain a larger number of loci for random forest classification (116 SNPs) and SPASIBA continuous assignment (118 SNPs).

Range-wide
This analysis included 190 C. odorata s. l. specimens from Central and South America and 116 SNP loci. Each individual was assigned to one of two regional classes (Table 1; Fig. 2a). The estimated mean of the median classification error of 5.8% for observed data was much lower than the estimated mean of the median classification error from 5,000 randomizations (51.1%; Table 1, Fig. 2g).

Target countries
This analysis included 141 C. odorata s. l. specimens from three countries (Fig. 2b) and 116 SNP loci. The estimated mean classification error of 34.3% for observed data was lower than the estimated mean classification error from 5,000 randomizations (68.4%; Table 1, Fig. 2h).

Narrow regional
This analysis included 129 C. odorata s. l. specimens from four narrow regions approximately 3.8 × 10 5 km 2 in size (Fig. 2c) and 116 SNP loci. The estimated mean classification error of 34.7% was lower than the mean classification error of 76.9% estimated from 5000 randomizations ( Table 1; Fig. 2i).
Classification errors for the range-wide comparison were almost equally distributed between groups, with mean errors of 4.42% for samples from Central America and 7.05% for samples from South America (Fig. 2d). Finer-scale classification tests showed classification asymmetry, where spatial assignment error was not equal across classes. For example, classification errors for the target country comparison showed that specimens from Ecuador had lower classification error (30.7%) than either Bolivia or Peru (35.1% and 37.0%, respectively; Fig. 2e). Similarly, specimens from the NE class had a substantially lower classification error Maps show the geographic range of specimens categorized into regional classes for: a range-wide, b target country, and c narrow regional classification. Boxplots show the classification error (%) estimates by class for: d range-wide, e target country, and f narrow regional classes. Density plots show the distribution of classification error estimated for: g range-wide, h tar-get country, and i narrow regional classification. Light gray distributions indicate error for observed genotypes, and dark gray distributions indicate error for genotypes after randomizing regional class identifiers in the grouping variable. Maps labels show country codes: NIC (Nicaragua), CR (Costa Rica), PAN (Panama), COL (Colombia), VEN (Venezuela), ECU (Ecuador), PER (Peru), BOL (Bolivia), and BRA (Brazil) (21.3%), than specimens from the NW, SW or SE classes (39.5%, 39.1%, and 39.0% respectively; Fig. 2f).

SPASIBA assignment
We investigated the accuracy of continuous assignments 148 C. odorata s. l. specimens with 118 SNPs, SPASIBA, and two cross-validation methods (sample distribution shown in Fig. S9). The median deviation between the known sampling location and predicted origin for the k-fold CV was 259.6 km (25%ile = 96.1 km; 75%ile = 820.3 km; Table 2; Fig. 3a) with a maximum error of 2540.8 km. Median k-fold CV prediction error with observed data was lower than the estimated median error from randomized genotypes (median 904.1 km; 25%ile = 494.6 km; 75%ile = 1408.7 km; maximum = 3,033.8 km; Table 2). We evaluated the estimation error in the imputed dataset used for random forest analysis to determine whether imputation of missing data influences prediction errors, and found that imputation had little influence on prediction error (imputed error = 268.4 km; unimputed error = 252.5 km; Table S5; Fig. S8).
The k-fold CV results also showed that specimens from the Peru (Fig. 2b) showed a lower median prediction error (169.8 km) than Ecuador (598.0 km) and Bolivia (341.4 km; Table 2; Fig. 3b), a pattern that appears different from random forest classification results for countries (Fig. 2e). Similarly, specimens from the NE narrow region (Fig. 2c) showed a lower median prediction error (133.3 km) than the other narrow regional groups (NW = 598.1 km; SW = 135.0 km; SE = 424.0 km; Table 2; Fig. 3b); this pattern appears similar to our findings from random forest classification of these regional classes (Fig. 2f).
The median prediction error was lower under the LOOCV framework than the k-fold CV (188.7 km; 25%ile = 92.2 km; 75%ile = 754.3 km; Table 2; Fig. 3a). This was a 27.3% decrease in prediction error compared to the median k-fold CV estimate. Despite this improvement, we found no relationship between sample density and prediction error (Appendix 4). Results for the total analysis include: prediction errors from k-fold CV for all specimens after 200 model runs with observed genotypes (imputed and unimputed combined) and 200 model runs with randomized genotypes, prediction errors from LOOCV, k-fold CV prediction errors for specimens used for the target country and narrow regional random forest classification questions. Values are the estimated median prediction error (in km), with the range in parentheses

Total
Target country Narrow regional specimens used for continuous spatial assignment with SPASIBA: a prediction error for South American specimens across 200 k-fold CV replicates (imputed and unimputed combined) with observed genotypes (light gray) and 148 LOOCV replicates with observed unim-puted data (blue), b prediction error for specimens from target country groups, and c prediction error for specimens from narrow regional groups. Red dashed lines indicate the estimated mean prediction error for randomized genotypes ( Table 2)

Discussion
The neotropical tree species Cedrela odorata is a target of illegal logging, and has been heavily used for its timber at least since the description of the genus Cedrela in 1756 (Browne 1756;Pennington and Muellner 2010). Illegal logging of C. odorata typically involves fabrication of source on export documentation (Urrunaga et al. 2012), and since 2001, C. odorata has been protected in at least one country under CITES Appendix III (Ferriss 2014), increasing the importance of technologies that predict the origin of C. odorata wood. Country-of-origin declarations on import documents for wood can be difficult for wood importers to verify, and it is even more challenging for customs and border patrol agents to corroborate, making independent assessment methods a high-priority tool to aid the legal evaluation of wood products. Although, C. odorata and all Cedrela species have been elevated to CITES Appendix II, geographic localization of C. odorata wood remains relevant for seizures predating CITES Appendix II listing. Geographic localization is also essential to discovering the networks responsible for illegal logging, as has been shown for animal poaching (Wasser et al. 2018). We demonstrated that SNPs have the power to at least partially resolve the geographic origin of C. odorata across much of its range in Central America and western South America, and we present results from discrete classification of geographic origin with random forests (Breiman 2001) and continuous spatial prediction with SPASIBA (Guillot et al. 2016).

Using SNPs to predict the geographic region origin for C. odorata via discrete classification
A number of methods have been used for discrete assignment of genotypes to geographic groups (Manel et al. 2005;Ogden and Linacre 2015). Some of these methods use criteria or assumptions that are explicitly 'genetic' (e.g., Hardy-Weinberg and linkage equilibrium) (Rannala and Mountain 1997;Pritchard et al. 2000;Piry et al. 2004), but others are agnostic with regard to the input data type or the process(es) underlying the input data (Chen et al. 2017;Schrider and Kern 2018). 'Model-free' methods like random forests can use high-dimensional genetic data and non-genetic data to produce predictive functions that are robust to any type of data and distribution (e.g., nonnormal distributions, zero-truncated, continuous, or categorical data), allowing genetic data to be combined with other information that provides independent evidence of geographic origin such as specifically stable isotope profiles (Kagawa and Leavitt 2010;Gori et al. 2015Gori et al. , 2018. This is especially important in cases involving plantation grown timber, where genotypic data may correctly identify the ancestral geographic origin, but not the growing location for a specific tree (e.g., plantation-grown C. odorata from Africa). This flexibility has led to the adoption of random forest methods for multiple applications in ecology and evolution (Boulesteix et al. 2012;Brieuc et al. 2018), genomics and genetic association analysis (Goldstein et al. 2011;Stephan et al. 2015), and population assignment based on genetic variation (Bertolini et al. 2015;Chen et al. 2017;Sylvester et al. 2018) and parasite community (Perdiguero-Alonso et al. 2008;Pérez-Del-Olmo et al. 2010). Random forests have been less frequently used for spatial classification, with current published cases based on reflection and chemical spectra rather than genotypes (Li et al. 2012;Finch et al. 2017).
In our specific analysis, we determined that random forest classification based on SNP genotypes can predict whether C. odorata s. l. specimens originated in Central or South America with 5.8% classification error (Table 1). This method offers high discrimination accuracy for broadscale geographic source validation, and could serve as a 'first-pass' test for questions related to provenance on trade documentation. We found that random forest classification was less precise for identifying finer-scale questions, such 'country-of-origin' or 'department-sized regions-of-origin' within a country (34.3% error and 34.7% error, respectively). We suspect that within-class sample size was at least partly responsible for the relatively high error estimations at this scale, since both of these analyses (target country and narrow regional) indicated that some geographic signal was available for classification with our SNP assay (Table 1). It is important to note that while our method did not show high precision for identifying the country-of-origin, the four South American countries that listed C. odorata as CITES Appendix III (Bolivia, Brazil, Columbia, Peru) account for > 63% of the land area of the continent. With denser sampling across northern and eastern South America, it should be possible to test this classification method using CITES III protection status (protected versus non-protected) as the classifier, as exports from all of these countries are highly restricted.

Using SNPs to predict the geographic origin of C. odorata by continuous assignment
Methods have also been developed for estimating the origin of genotypes using continuous assignment (Wasser et al. 2004;Yang et al. 2012;Rañola et al. 2014;Guillot et al. 2016). These methods assume Hardy-Weinberg and linkage equilibrium and only use genetic data, but these limitations are balanced by the potential power of providing a precise geospatial source for a sample, rather than a categorical assignment (Degen et al. 2017;Chen et al. 2017). Continuous assignment with methods like SPASIBA are particularly relevant for questions involving wood legality because harvest locations are frequently not included in genetic reference populations, especially for geographically widespread species.
Our median prediction error for continuous assignment of C. odorata in South America was ~ 189 km via the LOOCV method, and ~ 260 km via the more conservative k-fold CV method (Table 2). These error estimates show promise for country-of-origin predictions, but may be less helpful for smaller countries and areas near international borders. Assignment errors ranged from 170 to 598 km for target countries (with Peru showing the lowest mean error; Fig. 3b), and from 133 to 598 km for department-sized geographic regions in our study area (with NE and SW regions in Peru showing the lowest errors; Fig. 3c). These errors -while large -are comparable to the continuous geographic assignment errors from other organisms based on similarlysized datasets. For example, the mean error for the placement of humans based on 100 SNPs was ~ 430 km (Rañola et al. 2014), the 75% placement error of Arabidopsis based on 100 SNPs was 375 km (Guillot et al. 2016), and the median error for the geographic assignment of elephants based on 16 microsatellites ranged from 267 km (savannah) to 301 km (forest populations) (Wasser et al. 2015).
Despite the practical advantages of SPASIBA for providing continuous origin prediction, a potential disadvantage of the method lies in the assumption that spatial auto-covariance of allele frequencies diminishes with geographic distance. The assumptions of this simple gradient function may introduce error if allele frequency surfaces are irregular or lack a dominant cline. In our study, we also stratified samples by precipitation, temperature, and latitude ( Fig. 1) to identify genes that might be responsive to different climate factors over small geographic distances, as might be the case due to heterogeneous elevation gradients imposed by the Andes Mountains. In doing so, SNPs selected for this assay appear biased towards genes showing stronger differentiation by climatic gradients than to spatial gradients of geographic distance; we observed that pairwise genetic distance was more strongly associated with pairwise MAT and AP distances than pairwise geographic distance (genetic distance ~ MAT distance Mantel r = 0.42; genetic distance ~ AP distance Mantel r = 0.26; genetic distance ~ geographic distance Mantel r = 0.08; Appendix 5) (Goslee and Urban 2007). This bias may have reduced the accuracy of our SPASIBA predictions, either by violating assumptions of simple gradients, or by selecting genes that show weaker correlations with geographic distances than they do to climatic distances. We will explore solutions to this in the future, by examining additional loci that show higher correlations to geography than climate, and testing continuous assignment methods that relax the assumption of isolation-bydistance allele frequency gradients (Rañola et al. 2014;Battey et al. 2019).

Recommendations for improving SNP-based geographic predictions for Cedrela
The accuracy of assignment can be dramatically influenced by the pattern of geographic sampling and density of genome coverage, especially for continuous assignment methods. Although we did not observe a relationship between sample density and prediction error (Appendix 4), continuous methods have been shown to provide highest accuracies when training datasets include individuals from the same genetic background as test individuals (Guillot et al. 2016). Additionally, the impact of the size of the genetic database on assignment accuracy can also be substantial. For example, two independent analyses have shown that increasing genomic density from 100 to 1,000 SNPs leads to significant reductions in prediction error (Rañola et al. 2014;Guillot et al. 2016). Our foundation dataset of 144,083 SNPs for C. odorata offers a rich resource that can be used to further refine SNP assays for geographic assignment. In this context, we note that additional SNPs (~ 350) are currently being evaluated for spatial assignment of C. odorata and C. fissilis, and this includes different nuclear and organelle markers from different source populations (Blanc-Jolivet et al., unpublished;Paredes-Villanueva et al. 2019). Joint analysis of these two marker sets using common samples should show whether simply doubling the number of genotypes and SNPs offers significant improvements in geographic assignment accuracy. Finally, different Cedrela species can show different allele frequencies across loci, and this may distort allele frequency surfaces used in spatial assignment and lead to less accurate geographic predictions for C. odorata. In this regard, we recommend exploration of genetic structure across reference specimens with DAPC (Appendix 3) or a similar method. In our reference database for C. odorata, we identified examples of specimens that were taxonomically misidentified, and this is common in natural history collections of tropical plants (Goodwin et al. 2015). Additional "C. odorata" reference specimens should be assessed for taxonomic cohesiveness with C. odorata before usage with evidence.

Conclusions
We identified greater than 100,000 SNPs that can be used to develop and refine assays for geographic localization of C. odorata wood specimens. From this database, we designed and tested a 140 SNP assay to predict the geographic origin of C. odorata, and we evaluated discrete (random forest) and continuous (SPASIBA) prediction methods. These methods make different assumptions with regard to the Hardy-Weinberg and linkage equilibrium; as such, they may show different performance depending on 1 3 the degree to which SNPs track selective (environmental) gradients or deviate from genetic assumptions. Although the observed error estimates from our geographic predictions are too large for fine-scale geographic assignment, the assay shows high accuracy for determining the continent of origin and promise for country-level verification of specimens. This assay provides a tangible first step for determining the origin and legality of C. odorata wood, and these SNP resources and methods should provide the wood products industry with new (and developing) tools to improve the legality of C. odorata and closely allied species in wood trade. Contrato No. 001-2016-SERFORDGGSPFFS-DGSPF were granted for field work and genetic analyses to the von Thüenen Institute of Forest Genetics and Instituto de Investigaciones de la Amazonía Peruana in Peru. We are grateful to Gabriel Hidalgo, Gerardo Flores, David Aldana, Luisa Huaratapairo, and Eduardo Mejía of for their support in collecting samples and extracting DNA. Ecuadorian samples provided by Thüenen were collected with the coordination of Stephen Cavers as part of the 'SEEDSOURCE' project, F96-2002-INCO-DEV-1 contract number 003708. We appreciate project coordination from Ashley Warriner and A. J. Doty (U.S. Forest Service International Programs), and informatics and sequencing assistance from the Center for Genome Research and Biocomputing at Oregon State University. Funding for this study was provided by U. S. Agency for International Development (Award 19318814Y0010-140001) to the U.S. Forest Service International Programs, the U.S.D.A. Forest Service Pacific Northwest Research Station, and the Moldenke Endowment (Botany and Plant Pathology Department, Oregon State University).
Data Availability Raw sequence data associated with this study is available from the NCBI GenBank BioProject Archive under accession number PRJNA369105. Also see the data set available via the Oregon State University Scholars Archive https ://doi.org/10.7267/TQ57N X45Z. This supplementary directory contains a reduced representation nuclear genome reference for C. odorata, a VCF file containing 144,083 high quality SNPs for C. odorata which may be used to further refine out SNP genotyping assay for geographic assignment or other purposes, 140 SNP primers used for the Agena™ MassARRAY®, and data sets and R code to replicate our statistical analysis.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.