Background

The Amazonian rainforest hosts one of the greatest pools of terrestrial biodiversity, including very large tree species diversity [13]. In forest genetics, most efforts so far have focused on temperate and boreal tree species. While ongoing anthropogenic climate change is suspected to deeply affect the stability of Neotropical rainforests [4], tropical tree species genetic resources and adaptive potential are still poorly known [5], despite the availability of sequence data for several species [68]. Identification of polymorphisms and robust estimates of tropical tree species’ standing genetic diversity are thus needed to evaluate the vulnerability to environmental changes of populations and their ability to endure them [9, 10].

A thorough assessment of tropical tree species’ genetic diversity requires large amounts of genomic data and informative molecular markers [11, 12]. Single-nucleotide polymorphisms (SNPs) have become the most popular genome-wide genetic markers [13, 14] and are increasingly used to characterize potentially adaptive genetic variation (e.g. [1517]).

High-throughput sequencing and genotyping methods have paved the way to genomic studies in non-model species [14, 18, 19], by permitting cost-effective sequencing and the generation of very large genetic data collections. Thus, NGS provides a valuable tool to describe genome properties and variation in non-model species [14, 20]. While assembling whole genomes without a reference sequence can be very complex and in the best cases incomplete, transcriptome sequencing constitutes an efficient alternative in information-poor organisms [21]. Transcriptomes also include a large number of loci with known or predictable functions [22, 23] and have been applied to comparative genomics [24], marker discovery [25], and population genomic studies [26].

An array of next-generation sequencing strategies, varying in read length range and absolute throughput [27] can be used to sequence transcriptomes. The Roche 454-pyrosequencing technology, in spite of being the oldest among these, is the one producing on average the longest reads [23, 28, 29], which makes de novo assembly easier in non-model species without prior genomic resources [25, 30, 31] and allows preliminary screening of DNA variation [32] and transcriptome analysis (gene expression profiling by mRNA identification and quantification; [33]).

In this study we describe the transcriptomes of four widespread Neotropical tree genera chosen to represent different botanical families, ecological properties and patterns of local and range distribution (see Methods).

The objectives of the present study are (i) to describe the transcriptomes of these four tropical genera, (ii) to compare expression profiles among species and organs (leaves, stems and roots), and (iii) to provide an initial catalogue of well-supported mismatches, as candidates for validation as SNPs.

Methods

Study species and sampling

The four species studied (Symphonia globulifera L. f. (Clusiaceae); Virola surinamensis (Rol. ex Rottb.) Warb.; Carapa guianensis Aubl. (Meliaceae); Eperua falcata Aubl. (Fabaceae)) are characterized by contrasting ecological requirements and seed dispersal strategies (Table 1) [3443]. For each species, we collected about ten seeds from three different sampling sites: Paracou (5°16’20”N; 52°55’32”E) for E. falcata and V. surinamensis, Matiti (5°3’30”N; −52°36’17”E) for S. globulifera, and Rorota (4°51’32”N; −52°21’37”E) for C. guianensis. The study complies with the Convention on Biological Diversity. The collection was performed according to local and national legislation on the protection of biodiversity in sampling sites without any special protection status; all sampling permissions were acquired within the frame of the PO-FEDER “ENERGIRAVI” program, granted by the European Union and the Regional government, and by owners of sampling sites (CIRAD for Paracou, Lycée Agricole Matiti for Matiti, ONF for Rorota). The study species are not listed as Endangered by the CITES convention. All the data obtained in this study were shared with the local Regional authorities in compliance with benefit-sharing principles. Seeds germinated and seedlings developed in a greenhouse during twelve months under non-limiting light and water conditions as described in Baraloto et al. [44]. Two vigorous seedlings of each species were selected for transcriptome analyses. Plant material was sampled from three organs: leaves, stems and roots.

Table 1 Species description: distribution range, ecological properties relative to light (successional status) and soil, spatial population structure and seed dispersal properties

cDNA library preparation and sequencing

Total RNA from each fresh sample was extracted using a CTAB protocol as described by Le Provost et al., [45] (with minor modifications for a subset of the samples). mRNAs were converted to double stranded cDNA using either SMARTer PCR cDNA Synthesis Kit (Clontech) or Mint cDNA synthesis kit (Evrogen) according to the manufacturer’s instructions.

For each species, cDNA libraries from the different organs (leaves, stems and roots) were identified by a specific molecular identifier (MID) tag. Samples from the same organ of different conspecific individuals were pooled for sequencing (MID1 = leaves, MID2 = stems, MID3 = roots). Libraries of the different species were sequenced separately (one run per species) according to a standard Roche-454 protocol [46]. The raw data were submitted to the European Nucleotide Archive (ENA) database (study number: PRJEB3286; http://www.ebi.ac.uk/ena/) and given the accession numbers ERS177107 through ERS177110.

Assembly and functional annotation

The bioinformatic flowchart includes the following steps (Figure 1): for each species, .sff files were extracted into .fasta, .fasta.qual and .fastq files using the ‘.sff extract’ script available at http://bioinf.comav.upv.es/sff_extract/. The extraction was made both with and without clipping of read ends. Adaptor and MID sequences were identified in .fasta files (with unclipped ends) by searching exact motifs of MID1, MID2 and MID3 in the first twenty bases of each read. The distribution of clipped-end raw read sizes for all species is shown in Additional file 1: Figure S1.

Figure 1
figure 1

Bioinformatics flowchart.

Clipped-ends reads were de novo assembled into contigs using MIRA v.3.4.0. The software is rather flexible, has a large range of parameter choices [47] and it has been used efficiently for transcriptome assemblies [48]. We applied the “accurate” mode (with ‘job’ arguments: ‘de novo, est, accurate’) to limit the assembly of paralogous genes. Singletons (i.e. unassembled reads) were discarded for all subsequent steps.

Because different numbers of reads were obtained from different organs, comparisons in the number of contigs (unigenes) among organs may suffer from ascertainment bias, with libraries containing fewer reads displaying fewer contigs due to more limited sampling. To test for this effect we have applied the RaBoT method [49], which compares observed values of a given statistic (here, number of contigs) in a smaller sample (the ‘empirical’ value) with the value obtained from repeated sub-samples of the same size, drawn from a larger sample (the ‘bootstrapped’ values). The statistic in the larger sample is thus evaluated in the same conditions as in the smaller one, which allows an unbiased comparison and their difference to be tested statistically. RaBoT was applied with N = 100 sub-samples. Because the sub-samples were not independent, only the non-parametric test and P-value (i.e. the fraction of the distribution of ‘bootstrapped’ values that is above the ‘empirical’ value) are reported.

Assembled contig consensus sequences were submitted to Blast2Go (B2G) analysis (http://www.blast2go.de/b2ghome), which permits large-scale blasting, mapping and annotation of novel sequence data particularly in non-model species [50]. BlastX search was performed on species assemblies against the NCBI non-redundant protein database (with BlastX minimum e-value of 10−3, Number of Blast Hits = 20). We realized a semi-automated search for contaminants by verifying the organism identity of each blast hit as follows: NCBI Taxonomy CommonTree Browser (http://www.ncbi.nlm.nih.gov/Taxonomy/CommonTree/wwwcmt.cgi) was searched with a non-redundant list of species extracted from B2G. Contigs for which at least one of the ten hits with the lowest e-values (< 10−25) identified a sequence from a genus belonging to the "green plant" node of the generated tree were further considered as non-contaminants; contigs with no hits to any “green plant” genus were treated as contaminants and excluded. Contigs were then assigned to the minimum e-value informative functional annotations from plant species hits, provided that their e-value was smaller than 10−25.

The Gene Ontology annotation analysis (with e-value hit filter = 10−6, Annotation cutoff = 55, GO weight = 5, Hsp-hit coverage cutoff = 0) allowed the matching of each contig with Molecular functions, Cellular Components and Biological processes under the plant GOslim option. Annotation analyses were performed in all cases at levels 3 and 4, that is, with GO terms being three or four nodes away from the root of the GO term trees [51, 52]. These levels were chosen because they group genes according to processes at intermediate levels of biological integration, which can be readily interpreted in terms of implication in cell-, organ- and organism-level developmental and physiological functions [53]. Across-species sharing of Level 3 GO terms was inspected. Moreover, considering that a contig’s number of reads is a rough estimator of the level of expression of the corresponding gene [28, 54, 55], we used the number of reads belonging to contigs associated to level 4 GO terms to identify processes with organ-specific variations in expression levels. To identify those processes, we used a permutation analysis as follows:

  1. (i)

    The contigs associated to each level 4 GO term were identified, and the number R ci of reads obtained for each contig i from each organ was recorded. The following steps were executed separately for each organ;

  2. (ii)

    the observed average number of reads across all contigs associated to a given Biological Process R ci ¯ was computed; this statistic was considered as an estimator of the average expression level of all genes involved in that Biological Process (contigs with zero counts were excluded);

  3. (iii)

    then, the values R ci of reads per contig (within each organ) were permuted over all contigs 1000 times. At each permutation, the average read count of all contigs associated to a given biological process R ci , permut ¯ was recorded again, and the difference between empirical (observed) R ci ¯ and R ci , permut ¯ was recorded.

  4. (iv)

    the distribution of thses differences indicates how close to to average is the expression of genes belonging to a given GO term; i.e., for a Biological Process whose genes exhibit an average level of expression, the distribution of mean differences obtained from permutations overlaps zero; biological processes whose genes have expression levels above average have a distribution of permuted differences above zero, and vice versa for biological processes with genes showing less than average expression levels.

  5. (v)

    if, for a given biological process, the observed average read count per contig was larger than 95% of the average values obtained by permutation, then the group of genes associated to that biological process was considered as over-expressed, and consequently the biological process was considered functionally important for that organ.

Because a contig may be associated to different biological processes, steps (ii)-(v) above were performed for each biological process separately. Because all permutation tests were performed within organs, this analysis is not prone to biases in the number of reads per organ (see above and Results and Discussion). Comparisons among organs for variations in expression among processes were done qualitatively.

Mismatch identification

Assemblies were post-processed using both bioperl scripts from the SeqQual pipeline (Lang et al. in preparation) (Additional file 2), and home-made R scripts (Additional file 3) that followed various steps of filtering the data by integrating a number of quality criteria (and Additional file 4: Table S1 for a description of programs used). The different steps of the procedure used were as follows:

Splitting .ace assembly files and linking to quality

Assembled contig sequence files were extracted from the .ace files given by MIRA and linked to their original base quality scores contained in the .fasta.qual files.

Assembly cleaning

Nucleotide differences were screened in assembled contigs and particular bases were masked according to several criterions:

  • being a singleton

  • being a variant with a frequency lower than 0.1 (see also 4.3 below)

  • having a quality score lower than 20 for polymorphic sites.

Following this ‘masking step’, a ‘cleaning step’ removed all positions (i.e. corresponding to one base) of the assembled contigs that contained only indels and masked bases. This last step is particularly relevant for 454 data where false insertions due to homopolymers were very common and drastically affect contig consensus, hampering further re-sequencing and SNP design for genotyping. Consensus (using IUPAC codes) were edited from cleaned assembled data and used both for estimating the total transcriptome length obtained and for identifying well supported mismatches.

Computing mismatch statistics and post-filtering

All mismatches contained in the cleaned assemblies were used to build a summary statistics table (number of occurrences and frequency of the different variants, depth, mean quality, minor allele frequency (maf)). This table was used to identify the highest-quality mismatches a posteriori (without affecting assembly and consensus). In particular, we chose to avoid:

  • mismatches adjacent to each other, because they are likely to be assembly artefacts [56, 57].

  • mismatches with lower-than-expected frequencies based on the number of gametes sequenced. With two genotypes, four different gametes were sequenced with the probability of having a variant being 0.25 at minimum. The following rationale can be applied to any number of gametes 2N. The probability of observing a particular number of times (or fewer) the minority variant (with expected frequency in the sequence pool, p=1/2N) follows a binomial distribution. The probability of observing the variant exactly t times out of x reads is computed as p t = x t p t 1 p x t and the probability of observing it t times or fewer is given by i = 0 t p i . All polymorphisms that were present in a configuration (e.g. 3 variants among 29 reads) with a cumulative probability P < 0.05 were considered as false positives and were discarded. This led to the exclusion of additional variants with frequencies between 0.1 and 0.15 but with probability below 5%.

  • mismatches having a depth lower than 8X, which can be considered as a stringent criteria, given the 20 quality score for each base, a minimum SNP frequency of 2/8= 0.25 here (since singletons have been previously excluded), and the fact that this configuration has a probability of 0.31 based on the binomial distribution rationale, which is well above the 5% threshold chosen before.

Following the filtering steps described above, mismatches were counted and their density per base was computed as the total number of putative variants (including those at contig ends that passed the quality and singleton filters) divided by the total number of bases where the depth was at least 8 reads. Numbers of transitions, transversions, and deletions were also reported.

Results and discussion

Assembly

Sequence data were obtained from all tissues and species except S. globulifera, for which root cDNA sequencing failed. Between 167140 and 248145 reads were obtained per species; the distribution of clippend-end read length distributions is shown in Additional file 1: Figure S1. More reads were associated with roots than with stems or leaves (Table 2). This is in agreement with the higher levels of gene expression which were found in the roots compared to other organs in model species such as Arabidopsis thaliana[58]. Alternatively, this may be due to technical artefacts such as a more efficient RNA extraction and/or cDNA amplification from roots than from other organs, and a lower RNA extraction yield in leaves due to high concentrations of secondary metabolites. Nevertheless, all RNA samples were equally stable as no sign of degradation was detected after a two-hour incubation at 37 °C (data not shown). Also technical descriptors of the experiment such as RNA A260/A280 ratio, total amount of RNA used and total cDNA yield did not influence the number of reads, as shown by the non-significant P-values associated to each factor in a Generalised Linear Model (GLM; see Additional file 5: Table S2).

Table 2 Partitioning of reads among different organs (leaves, stems, roots) in each species cDNA library ( C. guianensis , E. falcata , S. globulifera and V. surinamensis ) with percentages in parenthesis

Between 103433 (S. globulifera) and153551 (C. guianensis) reads were successfully assembled into contigs and between 17103 and 23390 contigs were obtained, depending on the species (Table 3). These figures are close to the average number of contigs commonly obtained in similar studies [51, 59, 60] and suggest reasonable transcriptome coverage from the data if we assume that the number of contigs slightly overestimates (i.e. multiple contigs may come from the same transcript) the species’ unigene set. However, we expect genes with low expression levels to be missing from our catalogue, as the absolute numbers of reads obtained here prevents assembly of under-represented transcripts. Average contig length varied between 414 bp (E. falcata) and 523 bp (C. guianensis), and N50 values were just above average contig length for all species (Table 3 and Additional file 6: Figure S2); clearly, coverage of individual transcripts and representation of the transcriptome are only partial, and require extension with new sequencing actions, based on higher-throughput methods. The distribution of reads over contigs was quite even, but the coverage was low, with an average between 6 and 7 reads per contig and around 90% of the contigs with 10 reads or fewer (Table 3, Additional file 7: Figure S3). The number of contigs associated to each organ (i.e., the number of contigs including reads from a particular organ or combination of organs) varied widely (Figure 2); to check whether this was an artefact of the absolute number of reads obtained from each organ (Table 2), numbers of contigs obtained from each organ were submitted to RaBoT analyses. In all pairwise comparisons between organs, the number of contigs obtained from the organ with the larger number of reads remained larger after rarefaction (P-value = 1 in all comparisons, with the exception of the stem/leaves pair in C. guianensis, which had P = 0.010, indicating that the difference in number of contigs between these two samples is probably artefactual). Therefore, the larger number of contigs observed in organs with larger number of reads cannot generally be explained entirely by sampling bias. A large number of contigs was solely associated to roots for the three species (Figure 2), particularly in E. falcata (61% of contigs from roots only, compared to 29% and 37% for C. guianensis and V. surinamensis). In contrast, contigs exclusive to stems and leaves were in much lower proportions in the three species with root data, varying from 4% to 7% for stems, and 3% to 12% for leaves (Figure 2).

Figure 2
figure 2

Number of contigs associated with each organ (leaves, stems, roots) (Note: sequencing from S. globulifera roots failed). Carapa = Carapa guianensis; Eperua = Eperua falcata; Symphonia = Symphonia globulifera; Virola = Virola surinamensis. L, S and R indicate contigs specific to Leaf, Stem and Root, respectively; combinations of symbols correspond to contigs occurring in multiple organs.

Table 3 Assembly results: number of assembled reads, number of contigs, total transcriptome coverage, average length per contig, and average number of reads per contig

Functional annotation

Functional annotation based on BlastX and gene ontology analyses allowed classifying contigs into functional groups. A majority of contigs returned a Blast hit result with e-values below 10−25 (Additional file 8: Figure S4) for C. guianensis (79%), E. falcata (69%), S. globulifera (74%) and V. surinamensis (70%), but only between 48.1% (E. falcata) and 64.1% (C. guianensis) had functionally informative annotations (Table 4). Less than 3.1% of the characterized contigs were identified as contaminants for any species (1.58%, 3.06%, 2.92% and 0.29% in C. guianensis, E. falcata, V. surinamensis and S. globulifera respectively). After removing contaminants, from 12603 (S. globulifera) to 16912 unigenes (C. guianensis) with e-value < 10−25 were retained, that covered 4.75 Mbp (in S. globulifera) to 7.75 Mbp (in C. guianensis) (Table 4).

Table 4 BlastX statistics per species, performed on consensus sequences obtained from the MIRA assemblies

Gene Ontology analysis provided the annotation of all contigs with significant Blast hits. Additional file 9: Tables S3 and Additional file 10: Table S4 report respectively contig sequences and grouping of contigs by GO term. The more represented GO terms are globally very similar across species for at levels 3 and 4 for Cellular Component, Molecular Function and Biological processes (Additional file 11: Figure S5 and Additional file 12: complementary caption to Figure S5). Interestingly, cyclic and heterocyclic compound-binding (including nucleosides) dominate Molecular functions, with more than 40% of the contigs belonging to such terms (4 and 8, Additional file 11: Figure S5); for comparison, Parchman et al. [61] have found about 20% ‘nucleotide binding’ plus ‘other binding’ in Pinus contorta; the excess of functions related to aromatic compounds in tropical trees may suggest a major role of secondary metabolites, as indicated by Cottet et al. [62] for S. globulifera. This may be related to the very strong predation pressure exerted by herbivores [63] and pathogens [64] on tropical forest plants. Biological processes (level 4) are dominated by macromolecule metabolism, including again cyclic compound processing, somehow confirming the Molecular function results (‘Response to stress’ ranks fifth in Level 3 Biological processes, with about 4% of the hits, which is compatible with results in [610]). Overall, eighty-one biological process (level 3) were represented for all species (77, 73, 75 and 70 for C. guianensis, E. falcata, S. globulifera and V. surinamensis, respectively), of which sixty-six shared by all species, and five represented in only one species (Figure 3); however, the GO terms appearing in only one species were represented by only few contigs (Additional file 13: Table S5). The absolute numbers of contigs belonging to a given GO terms were highly correlated among species (r>0.99 for all pairs; Additional file 14: Figure S6). Given the differences noted above with a conifer species, this strong convergence among tropical species belonging to different families may reflect specific patterns to tree species that undergo the same environmental conditions rather than general patterns in plants.

Figure 3
figure 3

Sharing of GO terms (level 3) across species. Only non-contaminant contigs with an e-value lower or equal to 10−25 were retained for the analysis. Cg: Carapa guianensis; Ef: Eperua falcata; Sg: Symphonia globulifera; Vs: Virola surinamensis.

Permutation analyses allowed us to identify biological processes (level 4) showing a significantly higher occurrence of contigs for a given organ, that could be interpreted as a higher expression of genes belonging to that process in that organ (Figure 4 and Additional file 15: complementary caption to Figure 4). In leaves, between five (V. surinamensis) and ten (C.guianensis) biological processes stood out (Figure 4 left column), and eight of them were identified in more than one species. Not surprisingly, biological processes related to photosynthesis and carbon cycle in leaves appear in this group (‘carbohydrate metabolic process’, ‘carbon fixation’, ‘generation of precursor metabolites and energy’, ‘nitrogen cycle metabolic process’, ‘organic substance biosynthetic process’, ‘oxidation reduction process’, ‘photosynthesis’, ‘response to radiation’).

Figure 4
figure 4

Box-plot of permuted values of differences between observed and randomised R ci ¯ for individual GO terms in each organ/species. Only biological processes showing a positive difference (i.e. having a bootstrap interval that does not overlap zero, indicating higher expression levels than average) are shown. For detailed names of the biological processes shown, see Additional file 15. (A) C. guianensis; (B) E. falcata; (C) S. globulifera; (D) V. surinamensis (Note: sequencing from S. globulifera roots failed).

In stems, we detected between eight (S. globulifera) and twenty-five (V. surinamensis) biological processes (Figure 4 middle column) that had significantly higher-than-average expression levels, fifteen of them being shared among different species. At least a subset of these processes (‘cellular biosynthetic process’, ‘cellular component movement’,‘organic substance biosynthetic process’, ‘organic substance catabolic process’,‘secondary metabolic process’) are potentially related to cell differentiation events that occur during wood formation.

In roots, between seven (C. guianensis) and twenty-six (E. falcata) biological processes appeared as particularly over-expressed, eleven being shared by different species. They reflect two main functions of roots: water and nutrient uptake (‘response to inorganic substance’, ‘response to ‘organic substance transmembrane transport’) and response to stresses caused by soil constraints, which fall in two classes: (a) soil water depletion (e.g. ‘response to osmotic stress’) which frequently occurs in tropical rainforests during the dry season; (b) oxidative stresses caused by soil hypoxia, to which the processes ‘reactive oxygen species metabolic process’, ‘response to oxidative stress’, and ‘response to oxygen containing compound’ are related; flooding-induced hypoxia is particularly frequent in water-logged bottomlands.

rRNA intron-encoded homing endonucleases were very abundant in the E. falcata assembly (581 unigenes against 43, 39 and 17 unigenes in C. guianensis, S. globulifera and V. surniamensis respectively). In E. falcata, these unigenes comprised between two and 920 reads with a mean of 15.3 (s.d.=69.77). Homing endonucleases from group I introns are self-splicing genetic elements or parasitic genes mostly found in organellar genomes [6466]. Among contigs that showed BLASTX hits with rRNA-intron-encoded homing endonucleases in E. falcata, 69 were potentially polymorphic and contained from 1 to 18 mismatches with many haplotypes [67]. High transcription levels of such elements, combined with the high numbers of mutations that they have accumulated, suggests a massive but ancient genome invasion event [67, 68] in the E. falcata genome compared to the other three species. The evolutionary implications of transfers of such elements remain poorly understood, because of their ‘super-Mendelian’ inheritance (such elements may be both vertically and horizontally transmitted [69]), and because they have no known function [67].

Mismatch detection

It has been shown that relaxed criteria for in silico mismatch choice from next-generation sequencing data or previous EST databases leads to high failure rates in subsequent SNP design [70, 71]. We have applied a stringent filtering process based on data quality and a probabilistic argument in order to decrease the frequency of artifactual mismatches. Removal of poor-quality bases in the first steps reduced sequencing depth at mismatch positions from ~20-23x to ~16-17 (Additional file 1: Figure S1). Between 4434 (for S. globulifera) and 9076 (for V. surinamensis) potential variants were retained after all the filtering steps had been applied (Table 5). Between 5.5% (E. falcata) and 8.3% (V. surinamensis) of contigs contained at least one potential variant (Table 5). The great majority of mismatches (between 95.7% in C. guianensis and 99% in S. globulifera) were bi-allelic, with a majority of indels (Figure 5). The transition/transversion ratio (Ti/Tv) varied between 1.5 and 1.7, lower than those observed in other exome assemblies [71]. Estimated mismatch density across variable contigs varied between 0.89 per 100 bp (C. guianensis) and 1.05 per 100 bp (V. surinamensis) (Table 5). These estimates of mismatch density are in the same order of magnitude as SNP density estimates observed in other studies: Parchman et al.[72] reported between 0.6 to 1.1 SNPs per 100 bp in Pinus taeda, depending on the stringency of their filtering criteria. This may suggest that our mismatch filtering protocol eliminates large amounts of false variants, which would not be validated at the SNP design step. The validation of these mismatches is beyond the scope of this study, and therefore the variants identified here can only be considered as putative, candidate loci for polymorphism. Nevertheless, we advocate for the introduction of stringent criteria for the identification of these putative variants, as more liberal strategies can lead to large numbers of false positives, which lower the efficiency of large-scale SNP screenings.

Figure 5
figure 5

Mismatches represented based on their allelic pattern.

Table 5 Mismatch identification

Candidate transcriptome polymorphism and its usefulness in population genetics studies

Next-generation sequencing, allowing massive de novo acquisition of molecular data, provides a range of new potential applications for evolutionary and ecological-genetic studies in non-model species. High-throughput SNP data have indeed shown their potential for inferences about demographic and adaptive processes in natural populations [16, 7379]; for examples in tree species, see [80, 81]. However, SNP design and validation has often frustratingly low success rates, because candidate variant identification is not stringent enough; in this paper, we have proposed a strategy to filter out false positives based on multiple criteria.

Conclusion

The genomic resources obtained here will trigger new exciting fields of research on tropical biodiversity. Providing a catalogue of putative functions for genomic regions with a high potential diversity will help identifying useful candidate genes for further resequencing or SNP genotyping [12, 82, 83]. These genes belong to a large range of biological processes, including growth, reproduction, light and nutrient acquisitions, as well as plant response to biotic and abiotic stresses. Focusing on genes potentially involved in adaptive processes in Neotropical forest tree species will permit to test hypotheses about evolutionary processes underlying genome evolution and the build-up of biological diversity in tropical forest ecosystems.

Availability of supporting data

The raw data were submitted to the ENA database (study number: PRJEB3286) and given the accession numbers ERS177107 through ERS177110.