Introduction

Forests hold in excess of 75 percent of the earth biodiversity (Cvetković et al. 2019), wherein tropical forests are one of the key-category of luxuriant land ecosystems which incorporate world’s most diverse habitat types flourishing many dominant tree species. In tropical Asia, particularly the Indian subcontinent, Shorea robusta Gaertn. F. (Vern: Sal), a diploid (2n = 2x = 14) outcrossing species belonging to the family Dipterocarpaceae, have fundamental ecological and evolutionary significance besides being utilized for commercial timber production worldwide (Gautam et al. 2007). Further, the species has usages consisting of medicinal, fodder and fuelwood, being consumed by the locals and forest dwelling communities (Adhikari et al. 2017). Due to substantial overexploitation and habitat fragmentation of the tropical forests, the species’ range is in an alarming state, threatening the long-term maintenance of its genetic diversity and survival (Gautam and Devoe 2006). Recently, a SWOT analysis on the status of the century-old regeneration problem of S. robusta has been conducted, which revealed the need of molecular marker development and genetic diversity assessment for this vital species (Mishra et al. 2020). By and large, genetic knowledge of the tropical forest species is more limited than that of the temperate or boreal forests (Finkeldey and Hattemer 2007). Hence, a need exists for a comprehensive analysis of population genetics in S. robusta, which will be able to divulge the present status of gene flow, genetic diversity and population structure. For such genetic analysis, suitable molecular marker techniques are vital. Schulman (2007) stated, “since before the beginning of molecular markers, the use of traits in plants as markers for their genetic relationship predates genetics itself”, which illustrates the usage and essentiality of molecular markers for genetic-based studies.

During the last three decades, the world has witnessed a rapid increase in the knowledge about plant genomic sequences, and the physiological and molecular role of various plant genes, which has revolutionized the population genetics and its proficiency in improvement programmes of a species (Nadeem et al. 2018). Yet, to date, very few researches have explored the genetic diversity and population structure in natural populations of S. robusta. Previous studies analyzed genetic diversity of S. robusta based on isozyme and Inter Simple Sequence Repeats (ISSR) markers (Suoheimo et al. 1999; Surabhi et al. 2017). However, an overview of the genetic and population structure is presently not available for this premier timber resource in subcontinents, and only limited information could be extracted due to the scarcity of markers. Therefore, there is a necessity to use a robust marker system for population genetic analysis, and to fulfill that, Simple Sequence Repeats (SSRs) are the markers of primary choice due to several desirable features, such as codominance, high variability, reproducibility, wide genomic coverage, extensive information, and accessibility (Powell et al. 1996; Nybom 2004; López-Gartner et al. 2009; Wang et al. 2019). Despite recent advances in molecular markers, such as Single-Nucleotide Polymorphisms (SNPs) or DNA array-based markers, SSRs hold promise as breeder-friendly markers involving limited technical or operating difficulties.

Considering the above facts, the proposed work demonstrate the first low-depth genomic sequence data of S. robusta with main objectives aimed to: (1) provide high-quality sequence data and enrich the current knowledge of the genomic background for S. robusta; (2) identify and develop novel SSR markers based on the sequence-specific information; (3) functionally annotate the designed SSRs using public databases; and finally (4) validation of the polymorphic SSR markers in S. robusta populations for the authentication of markers discovered.

Material and methods

Plant materials and DNA isolation

Based on a wide-ranged field survey, forty-eight indigenous accessions of S. robusta were collected from three different geographical locations in the state Uttarakhand (India), with their geospatial features (viz. longitude, latitude and altitude) shown in Supplementary Table 1. Sampling of seedlings nearby each other could be closely related and hence, less variation can be observed. Thus, sampled leaves were randomly collected from the trees representing size class (DBH) variations with 300 m distance apart, with populations distributed evenly in a wider area to capture as much diversity as possible. Samples were immediately dried up with silica gel and brought to the laboratory of Genetics and Tree Improvement Division, Forest Research Institute, Dehradun, and stored at – 80 °C. The genomic DNA was isolated from leaf tissues using Doyle and Doyle (1990) protocol with minor modifications.

Illumina sequencing, library construction and genome assembly preparation

The arrangement of base pairs in a genomic DNA was determined using a molecular technique known as Illumina dye sequencing. The sequencing was performed by the M/s Clevergene Biocorp Private Limited (Bengaluru, Karnataka) with HiSeq X System (Illumina, San Diego, California, USA). A stringent filtering criterion was used to eliminate low-quality reads with the adapter sequences using software fastp (Chen et al. 2018), which is a data pre-processing tool used for quality control, trimming of adapters, filtering by quality, and read pruning to obtain high-quality clean reads. The sequence reads were then subjected to quality testing using the tools FastQC and MultiQC (Ewels et al. 2016), which allowed the analysis of parameters including base call quality distribution, % bases above Q20 and Q30, % GC, adapter sequence contamination, etc. The processed reads were assembled using assembler Megahit v1.1.3 (Li et al. 2015). The k-mer size range was set up from 21 to 141 with an increment of 28 using k-min, k-max, and k-step parameters. Notably, contigs shorter than 200 bp were removed from the assembly. Processed reads were mapped back to the assembled genome using assembler bowtie2 with default parameters (Langmead et al. 2012). The appropriate k-mer assembly was selected for SSRs mining on the basis of quality parameters. Subsequently, genome coverage was evaluated using the formula (https://genohub.com; https://www.illumina.com):

$$\mathrm{Genome Coverage }(\mathrm{GC})=(\mathrm{number of reads }*\mathrm{ read length})/\mathrm{assembly size}$$

Finally, using the software Repeatmasker (https://www.repeatmasker.org/faq.html#faq3), the repeat sequences were masked.

The SSR motif detection, primer designing and bioinformatics analysis

The program MIcroSAtellite (MISA) was used to detect and locate SSRs in the genomic DNA (Beier et al. 2017). Occurrence of repeats in the assembled genome revealed varied frequencies of di-, tri-, tetra-, penta-, and hexa-nucleotides. The program was able to identify and locate perfect microsatellites as well as the compound ones. Further, the primer pairs flanking in the region of SSRs were designed using the program PRIMER3 (https://bioinfo.ut.ee/primer3). The SSRs with at least 100 bp flanking sequence on both the ends were retained for primer sequencing.

Using NCBI BLASTX (https://blast.ncbi.nlm.nih.gov), the polymorphic SSR markers were compared to the non-redundant protein database to assess their putative functions. The Kyoto Encyclopedia of Genes and Genomes (KEGG) database was crucial in comprehending the systematic functional data and applications of genes in biological systems (Kanehisa 2000). Further, the KEGG BRITE (KB) was utilized to comprehend the functional hierarchy, while the KEGG PATHWAY (KP) maps were used to illustrate molecular interaction and reaction. In addition, the KEGG Orthology (KO) numbers obtained from the KEGG server was used to summarize gene name, gene orthologs, functional definition of the orthologs, and the functional pathways through a stand-alone tool, i.e., Gene Annotation Easy Viewer (GAEV) (Huynh and Xu 2018). The Linux-based Krait tool was used to infer the SSRs' relative abundance (loci Mb−1) and density (bp Mb−1) for the detected SSRs (Du et al. 2018). Lastly, the functional enrichment analysis was done through g: Profiler (Raudvere et al. 2019).

SSR validation, PCR amplification and data analysis

A subset of 60 primer pairs were synthesized for validation which consisted of 20 tri-, tetra-, and pentanucleotide repeats each based on the stringent parameters, such as product size 150–250 bp, % GC = 40–60% and temp 50–60 °C. The primers were tested for their amplification in a polymerase chain reaction (PCR) thermal cycler machine (Eppendorf Mastercycler Nexus). Screening and optimization of the annealing temperature of the primers were obtained by the gradient PCR (Tm gradient range of ± 3 °C). The amplification was performed in a 15-µl PCR reaction mixture, containing 30 ng of template DNA, 7.5 µl of Taq mix, 0.1–1 µg of both forward and reverse primers and nuclease-free sterile water. The PCR conditions used were as follows: initial denaturation at 94 °C for 3 min, followed by 35 cycles of 94 °C for 30 s, primer-specific Tm range for 30 s, annealing at 72 °C for 45 s; and a final extension at 72 °C for 3 min. The PCR products were electrophoresed and separated using 2% agarose gel buffered with 1 × TBE (Tris/borate/EDTA) along with 100 bp DNA ladder. The gel was stained with ethidium bromide (0.5 μg ml−1) and visualized in the gel documentation system. After being subjected to PCR amplification in 15 random genotypes representing 3 different populations of S. robusta, positively amplified PCR products were resolved in 3% high-resolution agarose to check polymorphism (Make: Sigma-Aldrich). Finally, polymorphic primers were identified as those amplifying alleles of various sizes across the genotypes. The band profile produced by each SSR was scored manually by giving each band an estimated value for allele size, which was then modified in accordance with the repeat motifs of the primers using the allele binning tool TANDEM v1.07 (Matschiner and Salzburger 2009). Identification of scoring errors and excess of homozygotes at each locus to analyze the presence of null alleles was done through program Microchecker v2.2 (Van Oosterhout et al. 2004). Afterwards, the marker data were evaluated to characterize the primers and estimate the informativeness of SSR markers developed using allelic data, by calculating parameters, such as numbers of different alleles per locus (Na), numbers of effective alleles (Ne), observed heterozygosity (Ho), expected heterozygosity (He), and the polymorphism information content (PIC), using program PowerMarker v3.25 (Liu and Muse 2005) and GenAlEx v6.5 (Peakall and Smouse 2012). Further, marker data were analyzed to depict the molecular variance (AMOVA) between different populations and within the genotypes of each population by calculating genetic differentiation (FST) and inbreeding coefficient (FIS) through the program GenAlEx. The population structure of the 48 genotypes with 9 SSRs with admixture models and correlated band frequencies to determine number of sub populations (K) was assessed using STRUCTURE v2.2 (Pritchard et al. 2000; Evanno et al. 2005). The Jaccard similarity coefficient, the unweighted pair group method with arithmetic mean (UPGMA), and the SAHN clustering tool were used to determine the genetic similarity and generate a dendrogram between the genotypes by program NTSYS-pc v2.10 (Rohlf 1998).

Results

Illumina sequencing, assembly, SSR identification and primer design

A total of ~ 10 Gb data represented by 69.88 million raw reads were obtained from a low-depth high-throughput genome sequencing approach (Table 1). The quality of sequenced data generated was portrayed by the calculated parameters, viz. GC content (33.69%), bases above Q20 (98.615%) and Q30 (91.23%), which were suitable for further processing. After quality filtration, cleaned paired reads were de novo assembled into 1,97,489 contigs (29 × coverage) with L50 value (16,369), L75 value (49,235), N50 value (5062), and N75 value (1536). Inclusively, based on k-mer, parameters, such as contigs size, read aligned percent, L50, L75, N50 and N75 were compared and the highest percentage (93.2%) of the aligned reads were selected for SSRs prediction. The raw sequencing data were deposited in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) with accession number PRJNA639024.

Table 1 Summary statistics of shallow genome sequenced data

The genome sequence data were utilized for identification of microsatellite repeats and development of SSR markers in S. robusta by scanning the contigs with the perl script MISA, which identified a total of 57,702 microsatellite repeats and 35,049 primer pairs were successfully designed from them. These SSRs were preceded as ‘SRGMS’, which stands for ‘Shorea robusta Genomic Microsatellite’ marker. Additionally, repeats were analyzed for their frequency and distribution in the genome, where AT/AT repeats were most plentiful among di-nucleotides (60.60%), AAT/ATT among tri- (30.55%) and AAAT/ATTT among tetra-nucleotides (6.48%). Lastly, AAAAT/ATTTT and AAAAAT/ATTTTT occur in very low frequency among penta-(1.75%) and hexa-nucleotides (0.61%), respectively (Fig. 1a–d). Notably, relative abundance and density of each repeat type were also determined, with di-nucleotides having the highest relative abundance (75.00 loci Mb−1) and density (1514.04 bp Mb−1) followed by tri- (49.08 loci Mb−1; 993.27 bp Mb−1), tetra- (32.06 loci Mb−1; 574.07 bp Mb−1), penta (11.61 loci Mb−1; 252.00 bp Mb−1), and hexa-nucleotides (4.67 loci Mb−1; 119.80 bp Mb−1). In different genomic regions, di-nucleotides were the most abundant type, except in the CoDing Sequences (CDS) and exon regions, where tri-nucleotides were abundantly present (Fig. 1e).

Fig. 1
figure 1

The frequency distribution of simple sequence repeat (SSR) types: (a) Radar indicates all types of SSR motifs; and (bd) most predominant repeat motifs, i.e., di-, tri-, and tetra-nucleotides; and (e) the distribution of identified microsatellite motifs in different genomic regions of S. robusta

For validation, a total of 60 primer pairs (20 each) from class tri-, tetra-, and penta-nucleotides were synthesized. A total of 24 out of 50 (48%) successfully amplified SSRs revealed a polymorphic banding pattern (Table 2; Fig. 2a, b; and Supplementary Fig. 1a–v), whereas rest ten were not amplified. Further, errors in the fragment separation and allele scoring were eliminated by binning to detect the null alleles. It revealed that out of 24 primer pairs, 15 were observed with an excess of homozygotes. Though assessing the values of parameters, such as Na, Ho, He, etc., for the full dataset and dataset excluding the null alleles, revealed the significant disparity among the estimates. The full datasets and gel images containing all 24 polymorphic primers have been provided in the Supplementary Table 2 and Supplementary Fig. 1. Consequently, for the authentication of the primers, nine SSR loci were evaluated across the 48 genotypes representing three S. robusta populations (Table 3).

Table 2 Characteristics and putative functions of 24 polymorphic SSRs (SRGMS) with E-value of S. robusta
Fig. 2
figure 2

Representative sample* through SSRs showing polymorphic banding pattern in S. robusta: (a) SRGMS 211, and (b) SRGMS 1370. *Where, M: 100 bp DNA ladder; 1–48 representing 16 genotypes each from 3 populations

Table 3 Genetic polymorphism of 9 SSR loci evaluated in three S. robusta populations

Functional annotation

The putative functions of all the polymorphic SSRs were obtained using a sequence similarity search against the non-redundant protein database NCBI’s BLASTX in order to highlight its functional relevance (Table 2). Out of 1,97,489 contigs, 15,914 contigs were successfully mapped into 431 pathways, involving large number of contigs in pathways of neurodegeneration—multiple diseases (260 contigs); amyotrophic lateral sclerosis (231 contigs); Alzheimer disease (219 contigs); prion disease (162 contigs); Salmonella infection (145 contigs); thermogenesis (144 contigs); human papillomavirus infection (144 contigs); Coronavirus disease—COVID-19 (135 contigs); endocytosis (134 contigs); chemical carcinogenesis—reactive oxygen species (132 contigs), etc. (Supplementary Table 3).

Functional hierarchies were obtained through KB and characterized into three categories of protein families, namely (i) metabolism, (ii) genetic information processing, and (iii) signaling and cellular processes (Fig. 3). The KO numbers obtained through KEGG were annotated using GAEV, which was further characterized by g: profiler into metabolic component, cellular component, and biological component with their GO ID and p-value. The highest number of GO terms were involved in the biological process, i.e., 382 followed by the cellular component (91) and molecular function (61) as revealed by Manhattan-like plot. The table below the plot, which includes the data source GO ID, term name, and p-value, also includes the identification that the plot highlights by hovering the circle. For example, the plot illustrates circle No. 1: the enrichment of the term GO:0,009,987 (cellular process) followed by circle No. 2: the term GO:0,008,152 (metabolic process), and so on (Fig. 4). Detailed results of 100 terms of the biological process, cellular component, and molecular function are illustrated in Supplementary Fig. 2.

Fig. 3
figure 3

Functional hierarchies obtained through KEGG BRITE

Fig. 4
figure 4

The hierarchical clustering of the genes assigned to a particular process in GO

Polymorphic potential of novel marker loci and their efficacy in population genetic analysis

Polymorphic primers were utilized for the estimation of key diversity measures in 48 genotypes belonging to three distantly located populations of S. robusta in Uttarakhand (Table 3). In total, 22 alleles were generated with nine SSRs across the genotypes with an average of 2.44 alleles per locus. The PIC of each SSR primer pair ranged from 0.020 to 0.554, with a mean of 0.252 ± 0.06. The mean range of Ho for the primers across all the populations was recorded between 0.021 and 1.000 with a mean of 0.324 ± 0.10, while He ranged from 0.020 to 0.596 with a mean of 0.277 ± 0.07. Further, AMOVA revealed that most of the genetic variation (97%) was confined within a population; thus, a very low genetic differentiation (FST = 0.029) was observed among the populations. It is also supported by the high mean value of gene flow across the primers (Nm = 17.90). The range of inbreeding coefficient (FIS) observed among the sampled populations was − 0.679 to 0.206 with a mean value of − 0.109 ± 0.08.

Moreover, structural analysis suggests an optimal K value of 2 [Supplementary Fig. 3a(i-iii)], which is far too low to predict any output. As a result, clearcut structure in the investigated populations of S. robusta was not apparent. A PCoA plot (Supplementary Fig. 3b) and UPGMA dendrogram were produced as a result of the intra-specific genetic diversity analysis using SSR markers (Fig. 5). These results demonstrated that 48 genotypes had been clearly split with a similarity coefficient of 0.79 into two distinct groups (Gp) of S. robusta, with GpI and GpII consisting of 44 and 4 genotypes, respectively. Notably, the former was separated with a similarity coefficient of 0.800 into two subgroups (SbGp), namely SbGpIa (43 genotypes) and SbGpIb (1 genotype), while SbGpIIa (2 genotypes) and SbGpIIb (2 genotypes) were separated from the subsequent GpII with a similarity coefficient of 0.885.

Fig. 5
figure 5

The UPGMA dendrogram unbiased measures of genetic distance among 48 genotypes representing 3 populations of S. robusta

Discussion

Outcrossing species generally have a great potential for gene flow, which assists them to maintain high levels of genetic diversity within populations (Hamrick 1983; Hamrick and Godt 1989; Tam et al. 2014). Given the persistence of genetic diversity, tree species typically acclimatize to long-term environmental change (Hedrick 2004). Compared with the anonymous markers, SSR markers yield more precise estimates of genetic diversity (Feng et al. 2016). Recently developed next-generation sequencing (NGS) platforms, such as Roche’s 454 GS FLX, Illumina’s Genome Analyzer (GA) and ABI’s SOLiD, offer opportunities for high-throughput, cost-effective genome sequencing, and rapid marker development (Li et al. 2018). Compared with the traditional library-based and in silico methods, DNA-Seq via. Illumina is quicker with a lower cost and less dependency on existing genetic resource of target plant species for sequence-based marker development. This would also bring advancement in molecular markers-based studies on those plants which lack a genomic database (Bosamia et al. 2015). For instance, high-throughput transcriptome sequencing has been successfully employed for identifying SSRs in trees, such as Hevea brasiliensis, Carapa guianensis, Eperua falcata, and Symphonia globulifera (Brousseau et al. 2014; Sae-Lim et al. 2019); and in angiosperms, such as rose, peony, and olive (Gao et al. 2013; Yan et al. 2015; Mariotti et al. 2016). Further, numerous SSR markers have also been developed for Shorea curtisii (Ujino et al. 1998; Obayashi et al. 2002; Ho et al. 2006) and Shorea leprosula (Lee et al. 2000; Lee et al. 2004; Cao et al. 2006), but only limited microsatellite markers were detected for S. robusta (Pandey and Geburek 2009, 2010 and 2011). Since the species lacking genomic sequence data, which are essentially required for SSR mining. Thus, SSRs could play an important role in genetic diversity analysis, gene flow pattern, DNA fingerprinting, marker assisted selection (MAS), etc.

The present study reports discovery of 35,049 novel SSRs (24 out of 60 were validated and found polymorphic) in S. robusta in which ~ 10 Gb raw sequence data were generated and assembled into 1,97,489 contigs representing genome size of 357.11 Mb with a coverage of 29 × . Recent past revealed that the approach has been used to develop microsatellite markers in various species viz. S. leprosula (Ng et al. 2009; Ng et al. 2021), Grevillea thelemanniana (Hevroy et al. 2013), Macadamia integrifolia (Nock et al. 2016), Populus pruinosa (Yang et al. 2017), G. juniperina (Damerval et al. 2019), Exbucklandia tonkinensis (Huang et al. 2019), etc., signifying the potential of this technology for the identification and development of novel SSRs in S. robusta, devoid of genome sequence information.

Genome annotation to get genome-wide information is quite effortless now since NGS has come into existence. Notwithstanding, the annotation related tasks are challenging and rely upon the accessible tools and procedures, and further to decipher the information contain in the sequenced genome. The putative functions of 24 polymorphic SSRs anticipated that the top-hit species were Theobroma cacao, Erythranthe guttata, Glycine max, H. brasiliensis, Ricinus communis, Citrus unshiu, Gossypium raimondii, Durio zibethinus, Gossypium arboreum, Vernicia fordii, Corchorus capsularis, Brassica napus, Cucurbita maxima, Gossypium hirsutum, and Cephalotus follicularis. The KP aims to organize and computerize all the current knowledge of molecular and genetic pathways from experimental viewpoint, which implies the understanding of the molecular interaction and reaction networks. Here, KEGG database were used to perform functional annotation, delivering specifics of organismal genes and pathways besides establishing an association between them. Likewise, systematic identification of Expressed Sequence Tags (ESTs)-based SSRs (EST-SSRs) were carried out in Pinus taeda in California (Liewlaksaneeyanawin et al. 2004); S. leprosula in Indonesia (Ohtani et al. 2012); V. fordii and Vernicia montana in southwestern China and northern Laos (Xu et al. 2012), Pinus dabeshanensis in China (Xiang et al. 2015), H. brasiliensis in (Danzhou) China (Hou et al. 2017), and Dalbergia odorifera in China (Liu et al. 2019); whereas, putative functional SNP markers were detected for Shorea parvifolia in (Kuching Sarawak) Malaysia (Seng et al. 2011) and Juniperus phoenicea subsp. turbinata in Spain (Garcia et al. 2018).

KB has united a variety of interactions, such as those between genes and proteins, elements and reactions, medications and illnesses, and organisms and cells (Kanehisa et al. 2019). Parallel studies were reported in Vatica mangachapoi (Tang et al. 2022), Hopea hainanensis (Huang et al. 2022), Neesia altissima (Pratiwi et al. 2022), C. capsularis (Satya et al. 2017), Hibiscus hamabo siebold & zuccarini (Wang et al. 2021), Abelmoschus esculentus (Nieuwenhuis et al. 2021), Helicoverpa armigera (de la Paz Celorio-Mancera et al. 2011), Gasterophilus nasalis (Zhang et al. 2021), and Operculina turpethum (Biswal et al. 2021). Additionally, GAEV was used to annotate the KO (Iacobas et al. 2019; Emami-Khoyi et al. 2020; Nand et al. 2020; Shah et al. 2021). The biological process (382), with the highest level of involvement in the functional enrichment analysis results is highlighted here by GO ID and p-value. Lately, this kind of characterization and annotation of genes were used to predict common functions of 12,886 whole-genome duplication (WGD) in S. leprosula (Ng et al. 2021), examination of differentially expressed genes (Yamasaki et al. 2017), validation of immune genes (Karthikeyan et al. 2021), identification of novel prognostic biomarker (Xu et al. 2020), analyses of Integrated Gene Expression Profiling Data (IGEPA) (You et al. 2020), identification of the blood-based signatures molecules and drug targets of patients with COVID-19 (Hasan et al. 2022), and annotation of protein–protein interactions (Ieremie et al. 2022).

In the present study, a total of 35,049 SSRs were recognized, where the highest being di-nucleotides (34,969) also showed maximum relative abundance and density (75.00 loci Mb−1; 1514.04 bp Mb−1), followed by tri- (17,630), tetra- (3741), penta- (1011), and hexa-nucleotides (351). The SSRs repeat analysis revealed that the most prominent and abundant frequency of motifs was observed for AT/AT and AAT/ATT, similar to the study conducted on arid-zone S. oleoides (Bhandari et al. 2020). In other species, such as S. curtisii, simple CT and compound repeats of CT, CA, AT, and CTCA were observed (Ujino et al. 1998); whereas in case of Drepanostachyum falcatum, AG/CT and CCG/CGG were observed in maximum number (Meena et al. 2021), etc.

The characterization of genetic diversity patterns at intra- and inter-population levels is a fundamental requirement for the establishment of forest genetic resources conservation and tree improvement programmes (Stojnic et al. 2019). However, molecular tools play an important role in the efficient management and utilization of genetic assets. Thus, the usage and implication of SSR-based molecular markers increases in revealing the genetic diversity among the populations of a particular species. Notably, standardization of the isolation protocol of DNA from the samples, the quality of the markers, and the accuracy of the genotyping data, actually determines the effectiveness and success of SSRs (Liu et al. 2017). In this research, a total of 50 out of 60 primer pairs yielded 100% clear bands across three different populations of S. robusta. The amplification rate (83.33%) was significantly higher in comparison to S. curtisii (23.07%) (Ujino et al. 1998) and Liquidambar formosana (72%) (Chen et al. 2020) and Parashorea malaanonan (82%) (Abasolo et al. 2009), due to originality of the species-specific marker.

Additionally, amidst 48 accessions of S. robusta, 24 out of 60 markers exhibited polymorphism and showed moderate levels of polymorphism. Here, a total of 22 alleles with an average of 2.44 alleles per locus were generated that is quite lower to one of the members, i.e., Hopea hainanensis of a family Dipterocarpaceae, which revealed a total of 229 alleles with an average of 11.45 alleles per locus while using 20 microsatellite loci (Wang et al. 2020). In another study, 41 alleles ranging from 2.8 to 4.2 allele per locus with six microsatellite loci in H. brasiliensis were generated (Yu et al. 2011). In Diospyros kaki Thunb. (Family: Ebenaceae), the number of alleles detected ranging from 2 to 17 with an average of 8.54 with 13 SSRs (Wang et al. 2021). Further, in the neighboring genus, more than 242 samples across eight populations of both Dipterocarpus costatus and Dipterocarpus alatus were genotyped through 9 loci, where an overall 26 and 28 alleles were detected with an average of 2.9 and 3.1 alleles per primer, respectively (Vu et al. 2019). All these studies confirmed that more the number of microsatellites used in a genotyping-based study, the more will be the number of polymorphic bands. The current study revealed PIC values ranged from 0.020 to 0.554 with a mean value of 0.252 for S. robusta (Table 3), which presumed to be low when compared to tropical and subtropical species, such as Pinus cineraria (PIC = 0.49 to 0.78) (Rai et al. 2017), D. costatus (PIC = 0.317) (Vu et al. 2019), S. persica (PIC = 0.630) (Monfared et al. 2018), and D. kaki (PIC = 0.7306) (Wang et al. 2021) but higher than D. alatus (PIC = 0.216) (Vu et al. 2019).

The population genetics and diversity studies are mainly based on estimating the alleles and genotype frequencies, and the changes caused by evolutionary forces, gene flow, mutations, genetic drift, and natural selection (Eriksson et al. 2001). It is necessary to assess the genetic variation levels within and among populations for understanding of the species evolutionary biology and tree improvement potentiality (Escuderoa et al. 2003). The key measures of the genetic diversity are observed (Ho) and expected heterozygosity (He) (Sherif and Alemayehu 2018), where He is considered as a most suitable measure for characterizing marker loci among the different genotypes of a species (Monfared et al. 2018; Xue et al. 2018). To this date, works on genetic analysis in S. robusta were conducted using isozymes and ISSR markers in Nepal and India, respectively (Suoheimo et al. 1999; Surabhi et al. 2017). Moreover, few microsatellite studies have also been conducted on this species (Pandey and Geburek 2009, 2010, 2011). Our estimates of heterozygosity and number of alleles (Ho = 0.021–1.000, He = 0.020–0.596, and Na = 2.44) are comparable with the range found in S. robusta (Ho = 0.49–0.77; He = 0.52 to 0.89, and Na = 11.80) in Nepal (Pandey and Geburek 2009) and Shorea guiso (Ho = 0.20–0.90; He = 0.66–0.87, and Na = 15.67) in the Philippines (Tinio et al. 2014). These measures were also equated with the members of the same family, such as S. curtisii (He = 0.64, Na = 7.9) (Ujino et al. 1998), Neobalanocarpus heimii (Ho = 0.67, He = 0.78, and Na = 8.8) (Konuma et al. 2000), Dryobalanops aromatica (Ho = 0.49, He = 0.71, and Na = 5.1) (Lim et al. 2002), and S. leprosula (Ho = 0.64, He = 0.70, and Na = 11.4) (Ng et al. 2004). The estimation of diversity measures (Ho = 0–0.755; He = 0.255–0.757) was successfully demonstrated with microsatellite markers in the neighboring genus H. hainanensis of China (Wang et al. 2020); D. alatus (gene diversity (H) = 0.223) and D. costatus (gene diversity (H) = 0.152) in Vietnam (Vu et al. 2019), which are closely linked with the estimated measures determined in the current study on S. robusta.

It has been suggested that a value lying below 0.05 indicates little genetic differentiation (Wright 1978; De Vicente et al. 2004) which implies very low genetic differentiation (FST = 0.029) in S. robusta populations. Conferring a negative value of inbreeding coefficient (FIS = -0.109) and low FST, structuring and inbreeding depression was virtually not observed. Lack of significant pair-wise FST indicates a pronounced gene flow among populations, due to no prominent physical barriers like mountain ridges (Pandey and Geburek 2009) during the sampling. A high rate of gene flow homogenizes the genetic differences among populations, even in the presence of intense selection (Zucchi et al. 2005). Besides, this area is characterized by continuous forests and gregarious distribution of S. robusta assisted by cross-pollination that supports high gene flow. Similarly, low FST (0.024) and low FIS (0.09) indicated lesser genetic divergence despite 15 continuous and disjunct populations of this species in Nepal (Pandey and Geburek 2009). The outcomes of genetic diversity study were also supported by the structure analysis, which showed a low K value (K = 2, default generated in case of low structuring; Supplementary Fig. 3(a-iii)), as populations are not clearly defined by any single cluster. This indicates that a single or a maximum of two ancestral gene pools may result in significant genetic admixing throughout the geographical areas. Yet again, PCoA and UPGMA cluster analysis revealed similar grouping, which tends to bolster the low value of FST.

Conclusions

The study demonstrates that SSR marker technique is a powerful tool for evaluating genetic diversity and relationships among the natural populations of S. robusta. Findings also revealed the utility of the microsatellite markers for assessing the genetic diversity estimates of this species. The novel set of genomic SSR markers in S. robusta were reported for the first time may serve as a useful tool for conservation and management of Dipterocarpaceae. For conservation implications, future molecular studies should cover the entire distribution range of the species, where the SSRs developed here might play a profound role in ascertaining biodiversity hotspots.