Genome-wide characterization of simple sequence repeats in Palmae genomes.

BACKGROUND
Microsatellites or simple sequence repeats (SSRs) have become the most significant DNA marker technology used in genetic research. The availability of complete draft genomes for a number of Palmae species has made it possible to perform genome-wide analysis of SSRs in these species. Palm trees are tropical and subtropical plants with agricultural and economic importance due to the nutritional value of their fruit cultivars.


OBJECTIVE
This is the first comprehensive study examining and comparing microsatellites in completely-sequenced draft genomes of Palmae species.


METHODS
We identified and compared perfect SSRs with 1-6 bp nucleotide motifs to characterize microsatellites in Palmae species using PERF v0.2.5. We analyzed their relative abundance, relative density, and GC content in five palm species: Phoenix dactylifera, Cocos nucifera, Calamus simplicifolius, Elaeis oleifera, and Elaeis guineensis.


RESULTS
A total of 118241, 328189, 450753, 176608, and 70694 SSRs were identified, respectively. The six repeat types were not evenly distributed across the five genomes. Mono- and dinucleotide SSRs were the most abundant, and GC content was highest in tri- and hexanucleotide SSRs.


CONCLUSION
We envisage that this analysis would further substantiate more in-depth computational, biochemical, and molecular studies on the roles SSRs may play in the genome organization of the palm species. The current study contributes a detailed characterization of simple sequence repeats in palm genomes.


Introduction
Plants in the palm family (Arecaceae or Palmae) are important economic crops that are widely cultivated in arid and semi-arid regions of North Africa, the Sahara, the Middle East, and eastward to the Indus Valley. Palmae is a distinct family of monocotyledon species with up to 2800 species currently known, which are distributed over 202 genera (Xiao et al. 2016). Palm plants are critical ecological and socioeconomic resources for many countries, including Saudi Arabia; they play important roles in food security, wood for building, ornamentals, and industrial materials (Barrow 1998;Aberlenc-Bertossi et al. 2014). The date palm (Phoenix dactylifera), coconut (Cocos nucifera), and African oil palm (Elaeis oleifera) are the most economically important fruit crops in the palm family. There are more than 3000 cultivars of date palm worldwide, of which 60 are considered to be important in the global market (Moussouni et al. 2017). Despite the increasing number of genomic studies on Palmae trees, little genome-wide characterization has been performed on these plants for the purposes of conservation and genetic assessment.
Assessment of genetic diversity is crucial for the conservation of palm cultivar germplasm. Estimates of the genetic diversity of palm plant germplasm have traditionally been based on morphological information (Elhoumaizi et al. 2002). However, morphological markers do not reliably provide accurate assessments because they are highly affected by environmental factors. Molecular markers are more informative at any developmental stage of the plant. Molecular breeding through marker-assisted selection would also expedite the genetic improvement of palm cultivars (Zhao et al. 2012). Microsatellites or simple sequence repeats (SSRs) are very useful markers for the analysis of plant diversity. In addition, SSR markers can be used for DNA fingerprinting to distinguish among closely-related palm cultivars. SSRs are tandem repeats of one to six base pairs per repeat unit, and are widely distributed in eukaryotic and prokaryotic genomes (Xu et al. 2016;Yang et al. 2003). Rapid expansions and contractions of these repeats due to replication slippage may make them useful for carrying out population genetics studies within a species (Huntley and Golding 2006).
The recent release of draft whole-genome sequences for several palm species provides an opportunity to carry out post-genomic analysis in order to identify and compare the distributions of SSRs across palm genomes. To date, draft genome sequences have been released for five species in the Palmae family: P. dactylifera, C. nucifera, Calamus simplicifolius, E. oleifera, and E. guineensis. This study aimed to screen these five genome sequences for microsatellites, detect SSR motifs, and analyze the frequency and distribution of SSRs.

Genomic quality assessment
Completeness of the genome assemblies was assessed with Benchmarking Universal Single Copy Ortholog (BUSCO) v3.0.2 (Simão et al. 2015) with default settings. BUSCO genes are good candidates for evaluating genome completeness because from an evolutionary perspective, they are expected to be found in the tested genome (Simão et al. 2015;Waterhouse et al. 2017). The BUSCO tool analyzed each genome assembly state in terms of complete BUSCOs, complete and single-copy BUSCOs, complete and duplicated BUSCOs, fragmented BUSCO, and missing BUSCOs using a plant-specific database (embryophyta_ odb9) that consisted of 1440 total BUSCO groups from 30 species.

Identification of microsatellites
Genome-wide SSR mining was performed by scanning each entire genome with the software PERF v0.2.5 (Avvaru et al. 2017). A number of criteria were adopted to identify perfect SSRs. Specifically, repeat sizes of 1 to 6 nucleotides long were searched, and minimum repeat numbers were restricted to 12 repeats for mononucleotides, 7 repeats for dinucleotides, 5 repeats for trinucleotides, and 4 repeats for tetra-, penta-and hexanucleotides, consistent with previous studies (Liu et al. 2017;Qi et al. 2018). The remaining parameters were set as default. Repeats with unit patterns being circular permutations and/or reverse complements were deemed as one type in this study (Jurka and Pethiyagoda 1995;Li et al. 2009). For instance, ACT contains ACT, TAC, CTA, TGA, ATG, and GAT in different reading frames or on the complementary strand. Different types of SSR repeats or motifs were compared in terms of relative frequency (the number of SSRs per Mb) and relative density (the total length of SSRs in bp per Mb). All graphical and statistical analyses were performed in the R programming environment (version 3.4.3) (R Core Team, 2017).

Assessing the completeness of the genome assemblies
We adopted the BUSCO plant lineage dataset, which consisted of 1440 single-copy orthologs for the Embryophyta lineage, to assess the completeness of each of the five genome assemblies. The C. nucifera genome assembly had the highest BUSCO scores among those surveyed (Fig. 1), with 1311 (91%) complete BUSCOs (1200 complete single-copy and 111 complete duplicated BUSCOs); 3.80% of sequences were fragmented (54 BUSCOs) and 5.20% were considered missing (75 BUSCOs). The BUSCO scores of C. nucifera, P. dactylifera, and C. simplicifolius genome assemblies were comparable, and higher than the two palm assemblies from genus Elaeis (E. oleifera and E. guineensis). However, the E. guineensis genome assembly showed low BUSCO scores relative to all four of the other assemblies ( Fig. 1). In the E. guineensis genome assembly, only 60 (4.20%) complete BUSCOs were identified (54 complete single-copy and 6 complete duplicated BUSCOs).
The number, length, relative frequency, relative density, and percentage of the six types of SSRs are shown in Table 2. The percentage, relative frequencies, and densities of different SSR types were found to vary greatly between the five palm genomes (Fig. 2). Dinucleotide SSRs were the most frequent type in P. dactylifera, E. oleifera, and E. guineensis, with the highest frequencies of 84.15, 53.01, and 57.37 SSR/Mb, accounting for 39.61, 42.11, and 40.50% of SSRs in these genomes, respectively (Fig. 2a, b). Mononucleotide SSRs were the most abundant type in C. nucifera and C. simplicifolius, with the highest frequencies of 77.53 and 111.28 SSR/Mb, occupying about 43.45 and 48.41% of all SSRs in those genomes, respectively. Mononucleotide SSRs were also the second most frequent in P. dactylifera, E. oleifera, and E. guineensis, while dinucleotide SSRs were the second most abundant type in C. nucifera and C. simplicifolius. Tri-and tetranucleotide SSRs were more frequent than pentanucleotide SSRs in all five genomes. Hexanucleotide SSRs were the least abundant across all five genomes, with a frequency of below 2.38 SSR/Mb, and accounted for only 1.12, 1.00, 0.98, 1.01, and 0.07% of all SSRs in these genomes, respectively (Fig. 2b). Dinucleotide SSRs were found to have the highest densities, ranging from 1109.95 to 1901.36 bp/Mb in P. dactylifera, C. simplicifolius, E. oleifera, and E. guineensis, whereas mononucleotide SSRs had the highest density (1360.92 bp/Mb) in C. nucifera (Fig. 2c).

Abundance and repeat numbers for different microsatellite motifs
The microsatellites in palm genomes were determined to be relatively AT-rich. To gain insight into this characteristic, we analyzed SSR motif composition. The most abundant SSR motifs were found to vary with species. The degenerated number of repeat motifs was found to be 2, 4, and 10; these were identical between species for mono-to trinucleotide repeat types and were different for tetranucleotide, pentanucleotide, and hexanucleotide repeat types.

Mononucleotide repeats
The predominant mononucleotide motif type was (  98.66% of the total number of mononucleotide SSRs in these genomes, respectively (Fig. 3a).

Dinucleotide repeats
The (AG) n motif type was the most predominant dinucleotide SSR in P. dactylifera, with a frequency of 44.02 SSR/ Mb and occupying about 52.31% of all dinucleotide SSRs in this genome (Fig. 4b)

Discussion
The availability of genomic sequences for several palm species provides the opportunity to elucidate and compare the distributions of microsatellites across these genomes. In a previous study, genomic microsatellite loci were screened for two Palmae species (P. dactylifera and E. oleifera) (Xiao et al. 2016). To the best of our knowledge, the present study is the first comprehensive report on the identification of microsatellites with 1-6 bp nucleotide motifs in five Palmae species: P. dactylifera, C. nucifera, C. simplicifolius, E. oleifera, and E. guineensis. Consistent search parameters were used to perform the same analysis for all five palm genomes. Computational approaches were utilized to elucidate and compare the relative frequency, relative density, and GC content of SSRs in these species. Perfect microsatellites were found to comprise 0.23-0.44% of the five palm genomes. The percentages of SSRs in species within the same genus (E. oleifera and E. guineensis) were comparable, and lower than in the other three palm genomes. This variation in the percentage of genome SSR content may arise from differences in computational methods used for SSR identification, the relative completeness of different genome assemblies as obviously observed in the genome assembly of E. guineensis, or real variation in microsatellite content among these species (Sharma et al. 2007). The six types of SSRs were not equally represented in all five palm genomes. In general, mono-and dinucleotide repeats were found to prevail. More precisely, mononucleotide SSRs were the most frequent repeat type in C. nucifera and C. simplicifolius, consistent with previous findings in monocots and dicots (Sonah et al. 2011) and similar to what has been found for eukaryotic genomes overall (Sharma et al. 2007;Qi et al. 2015). Dinucleotide SSR repeats were the most abundant type in P. dactylifera, E. oleifera, and E. guineensis, which is consistent with prior findings for dicotyledons (Kumpatla and Mukhopadhyay 2005). Tri-and tetranucleotide SSR types were found to have very similar frequencies in the five palm genomes. Hexanucleotide repeats were the least frequent SSR type in all five species, which is similar to what has been seen in previous studies (Subramanian et al. 2002;Liu et al. 2017;Manee et al. 2019).
Previously, microsatellite abundances were found to be similar in species of the same genus (Shi et al. 2014). Here, only E. oleifera and E. guineensis are classified into the same genus, and these species did not have similar profiles overall. Interestingly, the overall frequency and density of SSRs were about the same in P. dactylifera and C. simplicifolius, suggesting potential similarity in the genomic structures of these two palm species. This is further supported by these genomes having similar abundances of SSRs by type, with the exception of mononucleotide SSRs.
Within each type of SSR, microsatellite motifs were found to vary greatly for each of the five palm genomes. Among mononucleotide repeats, the most abundant motif was (A/T) n , accounting for 86.37-99.84% of the total number of mononucleotide SSRs. This observation is consistent with previous results from Volvariella volvacea, Agaricus bisporus, and Coprinus cinereus . Of dinucleotide SSRs, the (AT) n motif was the most frequent in all examined genomes except for P. dactylifera, and this trend was similar in dicots (Sonah et al. 2011), pineapple (Fang et al. 2016), cucumber (Cavagnaro et al. 2010), and sweet orange (Biswas et al. 2014). The most abundant dinucleotide repeat in P. dactylifera was (AG) n , which is consistent with previous findings in Brachypodium distachyon (Sonah et al. 2011), wheat , and garden asparagus ). Among trinucleotide SSRs, the (AAT) n motif was the most predominant in C. nucifera, C. simplicifolius, and E. guineensis, and consistent with reports from garden asparagus , cucumber (Cavagnaro et al. 2010), pineapple (Fang et al. 2016), and Medicago truncatula and Populus trichocarpa (Sonah et al. 2011). The (AAG) n was the dominant trinucleotide motif in P. dactylifera and E. oleifera, similar to previous reports in Arabidopsis thaliana (Sonah et al. 2011) and Brassica species (Shi et al. 2013). The AT-rich motifs (AAAT) n , (AAAG) n , (AATT) n , (AAAAT) n , (AAAAG) n , (AATAT) n , (AAATT) n , (AAAAAT) n , (ACATAT) n , and (AAAAAG) n were the most abundant tetra-, penta-and hexanucleotide SSRs in the five palm genomes. Overall, the overrepresentation of (AT) n motifs in palm genomes can be explained by the fact that strand separation is easier for AT-rich than for GC-rich sequences, raising the possibility of slipped strand mispairing (Zhao et al. 2011). A previous study revealed that the (AAAT) n , (AAAAT) n , (AAAAT) n , and (AAAAAT) n motifs also predominated in Brassica species (Shi et al. 2013).
GC content varies greatly among different genomes because of different selective constraints. It is important to identify the driving force behind GC content diversity in order to understand genome evolution across species. The overall GC content of eukaryotic genomes does not vary widely (Šmarda and Bureš 2012). However, in plants, grass genomes are known to have high GC content compared to other angiosperm families (Barow and Meister 2002;Šmarda and Bureš 2012). This study found the five palm genomes analyzed had lower GC content (28.12-39.65%) than do grasses (43.57-46.90%) (Singh et al. 2016), a number of Poaceae species , five monocots (43.57-46.14%), and two green algae (55.70 and 63.45%) (Zhao et al. 2014). In addition, GC content was not evenly distributed in three of the species, the exceptions being C. nucifera and E. guineensis ( ∼ 32%). Variation in GC content within each SSR type was also observed across the five genomes, with the exception of tetranucleotide SSRs. Tri-and hexanucleotide SSRs were generally found to have the highest GC contents. The results also suggested that (A/T) n motifs are the most predominant in each genome, consistent with findings in previous reports (Sharma et al. 2007;Shi et al. 2013;Li et al. 2016). This can be interpreted as confirming high AT content in the majority of the analyzed SSRs.
SSRs make up a significant proportion of the eukaryotic genomes and are highly polymorphic, surpassing coding gene sequences in both respects (Katti et al. 2001). The high mutation rates of SSRs make them highly informative and useful for a wide range of applications such as evolutionary research, population genotyping, and markerassisted breeding. Recent studies have utilized genomewide approaches for the development of SSR markers in plants (Shi et al. 2014;Deng et al. 2016;Kumari et al. 2019). Perhaps the main advantage of this strategy is to produce a large number of SSR markers distributed evenly throughout the genome. The construction of a Palmae SSR database for the scientific community would evidently have a significant impact on genetic studies in those species.
Comparative analysis of SSRs in these five palm genomes will provide a better understanding of the nature of these important sequences and will facilitate research on the role of SSRs in genome organization. Such knowledge will serve many useful purposes, including, among many others, the isolation and development of abundant markers for genetic and evolutionary studies mentioned above. In particular, elucidating the most frequent repeats in palm genomes provides an essential starting point for the library-based selection of markers that will be informative in distinguishing populations and cultivars within a species, or even for cross-species applications. This further provides an important foundation for characterizing genetic diversity in palm germplasm and for performing selection on valuable or undesired attributes while also maintaining and/or improving diversity.