Background

Evaluation of genetic diversity and population structure has significant implications for genetic improvement in plant breeding. It has been well established that the genetic basis of biological organisms is concealed within the genome sequence, and that base-pair substitution, insertion, deletion, and other alterations can lead to genetic diversity; the diversity of organisms are manifested through phenotypic, chromosomal and proteomic differences. DNA molecular markers, having stable performance, high polymorphism and other properties, are increasingly employed in taxonomical, genetic evolutionary, breeding, and cloning studies. The use of different molecular markers and different primers for a same marker may result in amplification of distinct regions of the genome. Theoretically, higher numbers of polymorphic markers used are associated with wider amplified regions that covers the entire genome and more accurate results.

EST-SSR (Expressed Sequence Tag-Simple Sequence Repeat) molecular markers have been widely used with many species and for many applications, such as genetic linkage mapping, comparative mapping, and evaluation of genetic diversity [1,2,3,4,5]. SRAP (Sequence related amplified polymorphism) was first used on Brassica in 2001 by Li G [6]. The genetic diversity and population structure analysis of Camellia sinensis by SRAP [7,8,9,10,11,12] have already been reported. SCoT (Start codon targeted polymorphism) marker was designed according to the Kozak sequence pattern and was developed after the discovery of the conservativeness of the initiation codon ATG (+ 1, + 2, + 3) flanking sequences, in which the positions + 4, + 7, + 8, and + 9 are occupied by nucleotides G, A, C, and C, respectively. These seven nucleotides are generally conserved. At positions − 3, − 6, and − 9, G is the usual nucleotide. Primers can therefore be designed according to the conservativeness of the initiation sequence SCoT marker allows single primer amplification of the region between two genes. Bertrand et al. first applied this marker on Oryza sativa [13]. Lately, SCoT molecular marker has been used to access the genetic diversity of plant species such as Saccharum spontaneum L [14], Dactylis glomerata [15], Mangifera indica [16], Arachis hypogaea [17], Saccharum officinarum [18], Podocarpus macrophyllus [19] and Paeonia suffruticosa [20]. Nevertheless, no similar study has been conducted on Camellia sinensis. Tea plant is an allogamous species; theoretically, after prolonged spontaneous hybridization, the genetic background of tea plant should be increasingly complex.

China is one of the main sources of tea germplasms. Currently, there are 1,100,000 ha of tea planting area, with different regions growing different types and different varieties of tea according to topographic, soil, and climatic characteristics. Xinan, Huanan, Jiangnan, and Jiangbei represent the four main districts of tea planting area in China. The Qinba area belongs to the Jiangbei district. In this research, 50 tea varieties, including those collected from different districts, common tea plant species, as well as local species in the Qinba area, were genotyped with EST-SSR, SRAP, and SCoT markers. Herein we constructed three types of molecular marker dataset which have important applications in diversity analysis, marker efficiency analysis, and correlation analysis that use these marker systems. Our study allowed the establishment of population structure, providing significant insights into the selection of molecular markers for tea plant breeding.

Results and discussion

Marker efficiency analysis

In this study, three types of molecular markers were used to differentiate tea plant accessions. A total of 1072 bands were produced using 118 primer pairs. 38 SCoT, 40 SRAP and 40 EST-SSR primers were selected for further studies according to the percentage of polymorphic bands (PPB), polymorphism information content (PIC) and the degree of clear band selected markers using six selected genotypes (Table 1). A total of 414, 338, and 320 bands were obtained using SCoT, SRAP and EST-SSR markers, respectively from the 50 test materials, which included 398, 302, and 301 polymorphic bands, with PPBs of 96.13%, 89.35%, and 94.06%. Comparisons of the three types of markers are shown in Table 2. SCoT markers have a higher marker efficiency and are excellent for the appraisal of polymorphic loci, except that its polymorphic information content is lower than that of EST-SSR.

Table 1 Amplification results of EST-SSR, SRAP, and SCoT primers
Table 2 Comparison of the efficiency of EST-SSR, SRAP, and SCoT primers

Correlation analysis among genetic distance matrices by three-types of marker dataset

Mantel tests [21] were used to measure the correlation between the genetic distance matrices generated by SCoT, SRAP and EST-SSR molecular markers. r ≥ 0.9, 0.8 ≤ r < 0.9, 0.7 ≤ r < 0.8, and r < 0.7 represented significant correlation, moderate correlation, weak correlation, and no correlation, respectively. In the present study, the coefficients of correlation (r) between the genetic distance matrices of SCoT and EST-SSR markers, SCoT and SRAP markers, and SRAP and EST-SSR markers were 0.19, 0.17, and 0.01, respectively (Fig. 1). Different molecular markers and different primers of the same marker all yielded distinct amplification products, which reflected the polymorphism of the genomic regions; hence, utilization of different marker designing strategies will produce different results. Theoretically, the validity of the results should improve with increasing numbers of markers and increasing coverage of the genome. Therefore, we employed three types of molecular markers to generated 1072 bands and to perform genetic constitution analyses.

Fig. 1
figure 1

The correlation between the genetic distance matrices using Mantel tests

Genetic constitution analysis

Analysis using STRUCTURE

One thousand seventy-two polymorphic bands with MAF (minor allele frequency) < 5% were used to elucidate the population structure of the entire pool of tea germplasms. In this study, STRUCTURE 2.3.4, which applies a Bayesian clustering algorithm, was used to simulate population genetic structure based on the assumption that the 1072 loci were independent. Using a membership probability threshold of 0.60, population K values from 1 to 10 were simulated with 20 iterations for each K using 10,000 burn-in periods followed by 10,000 Markov Chain Monte Carlo iterations in order to obtain an estimate of the most probable number of population. Delta K was plotted against K values; the best number of clusters was determined following the method proposed by Evanno et al. [22] and obtained via the Structure Harvester platform (http://taylor0.biology.ucla.edu/structureHarvester/). Delta K reached a maximum value at K = 2, suggesting that the 50 tea germplasm were best divided into two subgroups (Fig. 2).

Fig. 2
figure 2

STRUCTURE analysis of the number of population for K. The number of subpopulations(k) was identified based on maximum likelihood and k values. The most likely value of k identified by STRUCTURE was observed at k = 2. Note: Green bands: Group 1, Red bands: Group 2. The proportion of each color reflects the probability that each of the test materials (numbered from 1 to 50) belongs the corresponding group

UPGMA clustering

A dendrogram was constructed with cluster analysis using the unweighted pair-group method with arithmetic means (UPGMA), which demonstrated that the 50 genotypes could be clearly divided into 2 groups (Fig. 3). Group I included 27 varieties, and group II contained 23 varieties. The average similarity coefficient was 0.74. The two most closely related materials were 15 and 16, which have a sister line with a genetic similarity coefficient of 0.93.

Fig. 3
figure 3

Cluster dendrogram of 50 tea genotypes constructed based on UPGMA by EST-SSR, SRAP and SCoT

Principal components analysis

The top three principal components were used to analyze population structure. Principal component analysis was conducted under NTSYS-pc2.10e [23]. The results showed that the three PCs had contribution rates of 15.97%, 8.50% and 6.17%. PCA separated the 50 genotypes into two major groups (Fig. 4) which were consistent with the STRUCTURE and UPGMA results. GroupI consisted of 18 genotypes (Fig. 4, left), with the other 32 genotypes belonging to group II (Fig. 4, right).

Fig. 4
figure 4

PCA plots based on the first three components

The analysis performed using STRUCTURE, UPGMA and PCA yielded similar results, clustering the 50 genotypes into 2 sub-populations. Of note, PCA results had good consistency with previous results from STRUCTURE. The results generated using UPGMA were slightly different from those using STRUCTURE and PCA (Table 3) and bold numbers in group 1 by UPGMA represent the differences between the results using STRUCTURE and PCA and the results using NJ.

Table 3 Comparison of the clustering by STRUCTURE, PCA and UPGME

Conclusions

We firstly reported the use of SCoT markers to analysis genetic diversity of tea germplasms. The results showed that SCoT markers revealed high genetic diversity among tea resources. In the future, we planed to select core SCoT markers. Different kinds of molecular markers can reveal different and complementary information of the same genome. Thus, we highly recommend using more marker types for comprehensive evaluation of genetic diversity and structure. 50 accessions were clustered into 2 sub-populations based on STRUCTURE, UPGMA and PCA; there was no obvious differences between imported and local germplasms. The genes of exotic varieties have been constantly integrated into the gene pool of Qinba tea through long-term (20–25 years) tea breeding and production activities. The selection of varieties with economic characters was emphasized during the process of breeding, resulting in the loss of some tea resources and the decrease of genetic diversity; thus, it is necessary to introduce new tea tree resources in order to broaden the genetic diversity.

Methods

Plant materials

A total of 50 tea plant genotypes, representing most tea germplasm of the Qinba area in China, were collected from the tea experimental farm of the Hanzhong Institute of Agricultural Sciences during the 2016 growing season (Table 4).

Table 4 The 50 tea plant samples used for marker (EST-SSR, SRAP and SCoT) genotyping

DNA extraction and marker genotyping

Genomic DNA was extracted from fresh leaves of each individual using the modified CTAB technique and detected with 0.8% agarose gel electrophoresis. PCR was carried out as follows: 2 × Taq Master Mix (7.5 μL), forward and reverse primers (1 μL each, 2 μL for SCoT primers), RNase-free water (3.5 μL), and tea genomic DNA (2 μL). In order to improve the effect of PCR amplification, changing annealing temperature was used in a PCR reaction system; the reactions were programed as follows: initial denaturation at 94.0 °C for 5 min, denaturation at 94.0 °C for 1 min, annealing at 60.0 °C for 1 min, and extension at 72.0 °C for 1 min, for a total of 10 cycles; subsequently, a total of 35 cycles of denaturation at 94.0 °C for 30 s, annealing at 35 °C for 30 s, and extension at 72.0 °C for 1 min were performed. The duration of extension was 10 min; then storage at 4.0 °C. The selected primers were synthesized by Shanghai Sangon Biological Engineering Technology and Service Company (Shanghai, China). Initially, six germplasms (LongJing, ShanCha1, ChunBoLu, BeiBa11–6, Ning13–6, ZaoBaiJian) were used to screen markers for high polymorphim. Then, 40 pairs of clear and highly polymorphic EST-SSR and SRAP markers, and 38 paris of SCoT marker primers were selected from 154 EST-SSR pairs, 154 SRAP pairs, 125 SCoT pairs. Electrophoresis was performed using 8% non-denaturing polyacrylamide gel under 160 V voltage; the bands were visualized via silver staining.

Genetic variation and marker efficiency analysis

Following electrophoresis, each amplification band corresponded to a primer hybridization locus and was considered as an effective molecular marker. Each polymorphic band detected by a same given primer represented an allelic mutation. In order to generate molecular data matrices, clear bands for each fragment were scored in every accession for each primer pair and recorded as 1 (presence of a fragment), 0 (absence of a fragment), and 9 (complete absence of band). Excel was used to compute the marker index (MI) of the three types of markers and the marker frequencies of the three types of markers were compared. MI values were obtained from the average band informativeness (Ibav) and the effectiveness multiplex ratio (EMR); EMR represents the number of polymorphic loci and Ibav is given by the following formula:

$$ {Ib}_{av}=\frac{1}{n}\sum \limits_{i=1}^n\left(1-\left(2\left|0.5-{P}_i\right|\right)\right), $$

where Pi represents the proportion of the ith sample in the amplified locus and n represents the total number of amplified loci. Using the method reported by Smith et al. [24], the value of the polymorphism information content (PIC) was calculated with the formula:

$$ PIC=1-\sum \limits_{i=1}^n{P_i}^2-\sum \limits_{i=1}^{n-1}\sum \limits_{j=i+1}^n2{P_i}^2{P_j}^2, $$

where PIC represents the PIC value of the ith locus and Pij represents the frequency that allele j appears in the ith locus. The value of PIC varies from 0 to 1, with 0 indicating an absence of polymorphism at a given locus and 1 reflecting multiple alleles at a given locus. The level of polymorphism of each marker was assessed by the polymorphism information content (Botstein et al. [25]), which measures the extent of genetic variation: PIC values smaller than 0.25 indicates low levels of polymorphism associated to a locus, PIC values between 0.25 and 0.5 imply moderate levels of polymorphism, while PIC values greater than 0.5 indicate high levels of polymorphism.

Correlation analysis among genetic distance matrices by three-types of marker dataset

Mantel test was carried out with the batch file of the NTSYS-pc2.10e software.

Genetic constitution analysis

STRUCTURE v2.3.4 was used to assess the population structure of the 50 tea genotypes with 1072 loci. The number of sub-population (K) was set from 1 to 10 based on admixture models and correlated band frequencies. Genetic similarity coefficients were computed using the SM functionality of the NTSYS-pc2.10e software, cluster analysis were conducted using the UPGMA method, and the principal component analysis using the batch file under the NTSYS-pc2.10e software.