Association Mapping of Thousand Grain Weight using SSR and SNP Markers in Rice (Oryza sativa L.) Across Six Environments

Thousand grain weight (TGW) is an important determinant of rice yield, and correlates with grain size, plumpness and grain number per panicle. In rice, there are fewer association mapping studies relating grain weight traits using both SSR and SNP markers. In this study, in order to find robust SSR markers associated with TGW trait and mine elite accessions in rice, we investigated the TGW trait across six environments using a natural population consisted of 462 accessions, and then performed association mapping using both SSR and SNP markers. Using the six datasets from the six environments and their best linear unbiased estimator, we identified eight TGW associated SSR markers, with three environmentally stable and one newly found, on five chromosomes. The associated markers have genetic effect from 3.44% to 20.84%, and two of them carry stable elite allele with positive effect across different environments. Candidate interval association mapping using re-sequencing derived SNP/InDel markers further confirms the TGW-SSR association, and also suggests that 3 TGW-SSR associations were high confident in intervals of size from 176 to 603 kb. These results not only shed more lights on the genetics of TGW trait, but also suggest that the multi-allelic SSR markers should be used as an alternative power tool in gene or QTL mapping.


Introduction
Current food production is becoming limited because of shortages of cropland, water, and shortages of fertilizers that depend on fossil energy (Pimentel 2012). Today, used as the main food for more than half of the world's population, rice is expected to give higher yield to meet the needs of the increasing population (Fageria 2007).
Crop yield is a quantitative trait and has complex genetic background. For rice, the panicle number, grain number, and grain weight are three main components of the yield. Beyond the panicle number per unit area and grain number per panicle, the improvement of grain weight is the major way to further increase the yield (Xing and Zhang 2010).
Grain weight is a synthetic reflect of grain length, width, thickness and plumpness. All these traits have been comprehensively studied using bi-parental populations through QTL mapping Li et al. 2004Li et al. , 2020Weng et al. 2008;Wang et al. 2011;Tang et al. 2013;Xu et al. 2015;Dong et al. 2018;Bazrkar-Khatibani et al. 2019), and hundreds of QTLs related to these traits were reported and stored in the Gramene QTL Database (Gupta et al. 2016).
Besides QTL mapping, linkage disequilibrium (LD) based association mapping was used to investigate the genetics of grain weight-related traits in rice, and SNP markers derived from gene chips or genome sequencing were preferred (Huang et al. 2010;Huang et al. 2011;Zhao et al. 2011;Yang et al. 2014;Begum et al. 2015;Liu et al. 2019). However, SSR markers have many advantages comparing with SNP markers, such as highly polymorphic nature, ease of development, and easy-using, and thus is essential for marker assistant selection (MAS) and crop improvement (Collard and MacKill 2008;Miah et al. 2013).
In rice, maybe due to a low automatic level and laborintensive, there is relatively fewer association mapping studies relating grain weight trait using SSR in large population. In order to find robust SSR markers associated with TGW trait and mine elite accessions in rice, we performed a TGW-SSR association analysis cross six environments, and further confirmed the result using SNP markers. Our study is featured by 1) very high-quality phenotypes of TGW trait, 2) genetic distinct germplasm lines, and 3) super polymorphic SSR markers. The result not only gives invaluable SSR marker which can be used in MAS, but also suggests that combining SSR and SNP markers should be a lower-cost way in finding trait-marker associations.

The Phenotype Evaluation of TGW
All the 462 rice accessions, consisted of 340 lines from China, 1 line from Japan, and 121 lines from Vietnam, were planted in six environments, i.e., three locations in two years (Table S1). In each environment, the lines were planted in four replications and phenotyped separately for each replication. As a result, 24 datasets (3 locations × 2 years × 4 replications) were obtained for TGW (Table S2). To get a good proxy of all the information in the 24 datasets, the BLUE (best linear unbiased estimator) of TGW was calculated (Table S2).
In each environment, the four replications agree well with each other, and the correlation coefficient varies from 0.92 to 0.99 between replications (Table S3). Thus, a total of 7 datasets, including the mean values of the replications in each of the six environments and the BLUE, were used for the following analysis (Table 1; Table S2). The minimum, maximum, mean and median of the 7 datasets are in range of 15. 62-17.26, 31.87-35.50, 23.74-24.37 and 23.54-24.37 g, respectively (Table 1).
To analyze the stability of TGW under different environments, the correlation coefficient among the six environments was calculated. The result showed the correlation coefficient of TGW has a range of 0.58-0.78 among the six environments, and all the environments, except for NJ14, have a correlation coefficient ≧0.83 with the BLUE (Table 1). Meanwhile, the distribution of TGW trait in the natural population consisted of 462 rice accessions fitted the normal distribution by normal distribution test (P > 0.01) ( Table 1, Fig. 1), indicating that the trait is controlled by polygenes. The broad-sense heritability of TGW in our population was 80.44% calculated by using the datasets from the six environments.

The Marker Polymorphism and LD Analysis
To analyze the genetic diversity of the above 462 accessions, a total of 264 evenly distributed SSR markers were genotyped (Tables S4, S5). Of these, 261 markers, with 17 to 29 markers per chromosome, showed polymorphism among the accessions, and a total of 3361 alleles, ranging from 2 to 47 with an average of 12.88 per locus, were found (Table 2).
For the polymorphic markers, the statistical parameters were similar among the 12 chromosomes (Table S4). The major allele frequency has a range of 0.11-0.98, with the median and mean of 0.28 and 0.35 respectively while the minor allele frequency has a range of 0.01-0.41, with the median and mean of 0.14 and 0.16 respectively. These markers were very informative, 219 markers (83.91%) has a PIC (polymorphic information content) value > 0.5, only 6 markers were slightly informative (PIC < 0.2), and three marker (RM129, RM1108 and RM512) do not show polymorphism (PIC = 0) ( Table S4). The PIC value was negatively correlated with the major allele frequency (r = -0.98, P < 0.001) and positively correlated with allele number (r = 0.76, P < 0.001).
To evaluate the linkage relationship of the markers, allelelevel pairwise linkage disequilibrium for the 261 polymorphic markers in the 462 accessions were analyzed. As a result, 7,521,191 allele pairs were found. Of these, 766,571 (10.19%) pairs, including 64,071 pair within the same chromosome, showed strong LD (P < 0.001) ( Table S6). For the allele pairs in strong LD, the minimum, maximum, median and mean of r 2 were 0.018, 1.00, 0.06 and 0.13, whereas the allele frequency is 0.001, 0.99, 0.08 and 0.11.

Population Structure Analysis
Before conducting the marker and trait association analysis, we estimated the population structure using the polymorphic markers. The most likely number of subpopulations (K) in the 462 accessions was estimated to be 3 based on the delta K value (Fig. 2a). However, there was also a signal on K = 5 (Fig. 2a). To further confirm the population structure of the accessions, a NJ (neighbor joining) tree was constructed using the SSR markers. Four main branches (I, II, III and IV), as well as sub-branches in each main branch, were evident in the tree (Fig. 2b). Since more than 3 branches were observed in the NJ tree, we selected K = 5 for the population structure, in which 172, 48, 98, 96 and 48 accessions were grouped into population 1 to 5 respectively (Fig. 2c). Vietnam accessions were mainly grouped into subpopulation 1, 3 and 4. A striking feature of materials was that their admixture level (the fraction of genome from different subpopulations) was quite low, 446 (92.53%) accessions inherited more than 90% fraction of their genome from their belonging subpopulation (Fig. 2c). Comparing the results of the Structure program and the NJ tree, subpopulation 2, 3 and 5 from the Structure result were mainly in branch I, IV and III, and subpopulation 1 and 4 were scattered in branch I & II, and III & IV respectively (Fig. 2b).

Association Mapping of SSR Marker and TGW
To set the proper cut-off value (significance threshold) for the association mapping, 2000 times permutation tests were performed by reshuffled the original phenotype data and then association mapping. As a result, the 1% and 5% cut-off levels were set for the above 7 phenotype datasets respectively for the real marker-phenotype association (Table 3).  In total, 19 marker-TGW association pairs, with marker R 2 ranged from 3.44% -20.84%, were found, and 8 markers distributed in 5 chromosomes were involved (Table 4, Fig. 3). Of these association pairs, 5 were significant at the 1% level, with the rest at the 5% level. Of the 7 datasets, no marker-TGW association pairs was found in dataset the NJ14, while 1 to 5 pairs in the other datasets. This result was consistent with the phenotype analysis result that the dataset NJ14 has a lower quality suggested by the correlation analysis. Of the 8 associated markers, both RM566 and RM3600 were identified in 4 datasets, and thus should be stable; and RM259 had the largest effect with R 2 ranged from 12.38% in dataset YY14 to 20.84% in XY13 (Table 4).
To mine elite alleles for TGW trait, the allele effect of the above associated markers was predicted (Table S7). As a result, 108 negative values were detected, comparing 42 positive values, suggesting that most alleles give a negative effect. Interestingly, there were 2 alleles, i.e. 237/237 bp (base pair) in marker M566, and 150/150 bp in marker RM3600, give stable positive effect in different environments, and there are 102 and 29 accessions carrying the 2 alleles respectively. The average TGW of the 102 and 29 accessions were 24.87 (calculated from the datasets XY13, XY14 and YY14) and 24.61 (calculated from the datasets XY13, YY13 and YY14) respectively, and both were significantly larger than the global average value 24.01 (student t-test, P < 1E-03).

Candidate Interval Based Association Mapping of TGW Trait using SNP Markers
To further confirm the association mapping result above and give a higher resolution of the associated loci, the association mapping between the SNP/InDel variants flanking the target markers and TGW trait was performed using 57 accessions (Table 5). Of the 8 associated markers, 7 (except RM244) were properly aligned to the reference genome, and the sequences with length ± 500 kb (kilo bases) flanking the markers were retrieved. Thus, a total of 7 MB genome segment were analyzed, and total of 98,599 variants, were identified (Table 5, Table S8). Based on cut-off values from permutations (Table 3), a total of 1,313 loci were found to significantly associated with TGW in at least one of the 7 datasets at the significance 1% or 5% (Table 5, Table S9, Fig S1). The association mapping result showed that significant association signals exist in each of the intervals flanking the 7 markers, but most loci were located in the interval flanking the marker RM486, RM566 and RM255, while less were found in the interval flanking the rest of the markers ( Table 5, Table S9). For the marker RM486, RM225 and RM566, most significant loci were located in the interval 34,840-35,443 (603 kb in length), 3,009-3,292 (283 kb), and 14,602-14,778 kb (176 kb), respectively (Table S9).

Discussion
Despite of the controversy over the origins of rice (Fuller et al. 2007), the Asian-cultivated rice accessions occupy an important position, and is grown worldwide and comprises the staple food for half of the global population. Of the 462 accessions used in this study, part was from Vietnam (latitudes below 17°N), and part was from northeastern China and Japan, (latitudes above 45°N) (Table S1). We observed little genetic admixture among the accessions (Fig. 1), and this situation agrees well with previous studies Dang et al. 2016;Edzesi et al. 2016;Liu et al. 2017). This result may partly because that a fewer marker used in inferring the population structure, and some admixture may be missed out. However, the distinct genetic background is a true reflection of the feature of our materials. The 2 parts of our materials belongs to the subspecies indica (accessions from Vietnam) and japonica (accessions from northeastern China and Japan) respectively. The two subspecies were rarely hybridized due to different flowering time (Liu et al. 1996). Therefore, these germplasm lines are invaluable for genetic studies and rice breeding. The high polymorphic feature of our materials was also reflected by our SSR result. We found 2 to 47 alleles with an average of 12.88 per locus, and the average PIC for all markers was 0.72. The polymorphic level of our population is higher than that from Garris and Dang et al. (2015). TGW is an important trait for many crops, such as maize, rice, soybean and wheat, and is quantitative trait which can be easily affected by environmental factors. In order to find robust SSR markers associated with TGW trait and mine elite lines in rice, we used 462 rice accessions and phenotyped them in 6 environments. A total of 24 datasets (including replications) were obtained, and an incredible consistency among these datasets indicating by high correlate coefficient was observed (Table 1, Tables S2, S3). There were some studies that studied TGW gene mapping, but usually used datasets from 3 to 4 environments (Dang et al. 2015;Qiu et al. 2015;Feng et al. 2016). Given the development of the high-throughput and low-cost genotyping technology, phenotyping will become more and more important. In our association mapping, 7 datasets were used, and all the datasets fit a normal distribution well (Table 1, Fig. 1). Comparing with other studies, our high quality phenotype datasets would lay a good foundation for the following QTL mapping. Robust marker-trait association is essential in MAS. In this study, we found 8 SSR markers associating with TGW trait (Table 4). Of the 8 markers, RM3600 was a newly found; while the other 7 markers, i.e., RM259, RM486, RM218, RM225, RM314, RM566 and RM244, were reported to associate with TGW by previous studies (Moncada et al. 2001;Cui et al. 2002;Hua et al. 2002;Zhuang et al. 2002;Gao et al. 2004;Jiang et al. 2004. However, none of these previous studies covered all the markers in one studies. Interestingly, these markers were also reported to associated with grain number (marker RM259) (Hua et al. 2003), grain yield (marker RM486 and RM314) (Moncada et al. 2001), and biomass yield (marker RM225) (Lian et al. 2005). Therefore, this result genetically explained why these trait correlates with each other.
Of the 462 accessions, there are 2 accessions, i.e. Z7238 and Z5164, have higher TGW values, and are slightly outlied of others (Fig. 1, Fig S2). The TGW phenotype of the 2 outliers is stable cross different environments (Fig S2), and thus should be controlled by genetical factors. In order to test whether our association result differ with or without the 2 outliers, we performed the SSR-TGW association after removing the 2 outliers. As a result, the result was very similar: 6 of the 8 markers, except RM314 and RM244, were detected (Table S10). This may partly because that an inappropriate cut-off (i.e., the cut-off listed in Table 3, which were derived from the permutation of all the samples) was used. To sum up, our result is robust to a few outliers, because there are at least 10 accessions for each allele.
Of the 8 TGW associated markers, 3 markers, i.e. RM259, RM566 and RM3600, were identified in 3 out of the 6 environments, and thus could be regarded as environmentally stable. Luckily, based on the stable markers, we identified 2 alleles, i.e. 237/237 bp in marker M566 and 150/150 bp in marker RM3600, give stable positive effect in different environments, and 102 and 29 elite accessions carrying the 2 alleles were found respectively (Table S7).
The re-sequencing data of 57 accessions was used for validating the SSR association mapping result and getting a higher resolution of the associated loci. The size of ± 500 kb was used because it is a safe size for linkage disequilibrium is rice (Mather et al. 2007). We performed a permutation in testing SNP/InDel-TGW association using the whole genome-wide SNPs (11.54 Million) in our  re-sequenced population (57 samples), few significant signals were detected (data not shown). However, when using the SNP/InDel variants in our candidate intervals, significant SNP-TGW associations were detected (Table 5). Due to LD of variants in association studies, the variants close to the true associated loci usually give strong association signals (Mather et al. 2007). Therefore, for our SNP/InDel-TGW association result, we proposed that the number of significant TGW-associated loci in the target interval was a good proxy to validate the SSR-TGW association result. We found 974, 123 and 189 unique loci in the intervals flanking the marker RM486, RM225 and RM566 respectively ( Table 5). This result suggests that true TGW-associated loci should exist in the intervals. Furthermore, this result narrowed down the candidate intervals to Chr1:34,443,Chr1:3,292,and Chr9:14,778 kb for marker RM486, RM225 and RM566 respectively (Table S9). Whereas the marker RM259, RM218 and RM3600, few associated loci were found in their intervals (Table 5), and the probable explanations could be 1) there is no true TGWassociated loci in the intervals, 2) the re-sequencing population is not large enough to get a significant result, and 3) the true TGW-associated loci is outside of the intervals.
In this study, we also planned to test that multi-allelic SSR markers have a higher discrimination power than biallelic SNP markers in association mapping. However, we tested the intervals flanking the 7 markers, and got significant association signals in every interval when using SNP markers (Table 5). It seems that the low level polymorphism limitation of SNP markers was compensated by their abundance in chromosomes. On the other hand, if in strong LD block, SNP markers indeed have lower discrimination power, which is a common and annoying problem in gene or QTL mapping (Tsykun et al. 2017). Of the 8 TGW associated markers we found, RM566 was in strong LD with 7 markers (i.e., RM3600, RM3533, RM6570, RM201, OSR28, RM1013, and RM3912-2), and RM244 was in strong LD with 9 markers (i.e., RM216, RM311, RM1125, RM258, RM269, RM6160, RM590, RM333 and RM6646) (Table S6). We successfully ruled out the LD markers from the truly associated ones in our result (Table 4). This result may suggest that the multi-allelic SSR markers were an alternative tool to conquer LD in gene or QTL mapping. However, further comprehensive simulations should be performed to prove this.

Plant Materials and Phenotyping of TGW
A total of 462 rice accessions, stored and supplied by State Key Laboratory of Crop Genetics and Germplasm Enhancement (Nanjing Agricultural University), were used as the plant materials. Of these, 121 were from Vietnam, and 340 were from China, 1 was from Japan (Table S1). These rice lines were representative germplasm resources of Asia and have been widely using in breeding research in China.
To evaluate the TGW phenotype, all the accessions were planted in 3 location for 2 years, i.e., Nanjing (the Experiment Farm, Nanjing Agricultural University, 32.01°N, 118.64°S) (short for NJ13 and NJ14), Xinyang (32.12°N, 114.067°S) (short for XY13 and XY14) and Yuanyang (35.06°N, 113.94°S) (short for YY13 and YY14) in 2013 and 2014. In each of the field experiments, 4 replications in randomized block design were used, and every line was planted in 5 rows, with 8 plants each row, 20 cm between rows and 17 cm between each plant. The plants were grown in normal rice growing season condition (from May to October each year) until totally mature. All the plants of each accession were harvested together, and the seed were dried for 4 days under sun. For the TGW phenotyping, 1000 plump grains were random selected and weighted. This phenotyping process was repeated for 3 times, and the average value was used for further analysis.

SSR Marker Genotyping and Variants Detection
The genomic DNA of each accession was extracted from young leaf tissue using DNeasy Plant Mini Kits (QIAGEN, Hilden, Germany) according to the kit manual instruction. A total of 264 SSR markers were selected from the Gramene database (https ://archi ve.grame ne.org/) according to their distribution on the rice 12 chromosomes (Tello-Ruiz et al. 2018) (Table S4). Primers were synthesized in Generay Biotech (Shanghai) Co., Ltd. (Shanghai, China). A 10 μl PCR reaction system was used: 10 mM tris-HCl (pH 9.0), 50 mM KCl, 1.5 mM MgCl 2 , 0.1% triton X-100, 0.5 nM dNTPs, 0.14 pM primers, 0.5U Taq polymerase, and 20 ng template DNA. The amplification was performed using a PTC-100™ Peltier thermal cycler (MJ research™ Incorporated, The USA) based on a standard PCR condition. The PCR products were run on 8% polyacrylamide gel at 150 V for 1-1.5 h, and visualized using silver staining. The size of PCR product was calculated by the software Quantity One.
The genome re-sequencing was performed on an Illumina Hiseq2000 platform (Illumina, San Diego, USA) according to its standard sequencing protocol, and pair-end library with insert length of 300 bp was used. The rice genome assembly IRGSP-1.0 (c.v. Nipponbare) was downloaded from NCBI (https ://www.ncbi.nlm.nih.gov/assem bly/GCF_00143 3935.1/) and used as the reference genome. Sequencing reads of the individuals were first filtered using Trimmomatic 0.33 (Bolger et al. 2014) and then mapped to the reference using BWA (Burrows Wheeler Aligner) 0.7.17 (Li and Durbin 2009). In both programs, default settings were used. After mapping, duplicates marking, base quality re-calibrating, SNP/InDel joint calling were performed according to GATK (Genome Analysis Toolkit) best practice in GATK 3.8 ( Van der Auwera et al. 2013). The resulting variants were filtered in Tassel5 (Bradbury et al. 2007) with the following parameter: minimum sample counts cut-off 20%, minor allele frequency cutoff 0.05. The location of the SSR markers was determined based on alignment of their primer sequences and the reference genome, in which the tool short-blastn in BLAST 2.7 (Camacho et al. 2008) was used.

Phenotype Statistical Analysis
All the description statistical analyses were performed in R language. The broad sense heritability was computed using the lme4 package (Douglas et al. 2014) using the formula H = V g / (V g + V ge /L + V e /RL), where Vg is the genetic variance, V ge is variance of genetic by environment, V e is the error variance, and L is number of environment and R is the number of repeats in each environment. The BLUE value of each accession was calculated in the lme4 package based on the year × location × replication matrix, with the accession as fix effect and year, location as random effect (Douglas et al. 2014).

Marker Diversity and Population Structure Analysis
The diversity of the SSR markers was analyzed in Pow-erMarker 3.25. PIC was calculated for each marker using the following formula: PIC = 1-Σ (i from 1 to n) P i 2 -Σ (i from 1 to n-1) Σ (j from i+1 to n) 2 × P i 2 P j 2 , where P i and P j are the allele frequencies at alleles i and j, respectively, and n is the number of alleles (Botstein et al. 1980). A higher PIC indicates that a marker is more useful for distinguishing individuals and understanding relationships among them. The major and minor allele frequencies refer to the frequency at which the first and second most common allele occurs in a given population, respectively (Florez 2013). Due to lacking of polymorphism, three markers, i.e., RM129, RM1108 and RM512, were discarded in further analysis. For measuring SNP/InDel variants diversity, nucleotide diversity, defined by Nei and Li (1979), was calculated in VCFTOOLS 0.1.17.
The linkage disequilibrium between each pair of alleles from each pair of markers and their significance levels were analyzed in Arlequin 3.5.2 (Excoffier et al. 2005). The r 2 (r 2 = D A × D B /p A (1-p A )p B (1-p B ), whereas D = p AB -p A × p B ) value was used to measure the level of LD between markers and the significance level (P-value) was calculated using an extension of Fisher exact probability test on contingency tables (Excoffier et al. 2005). The allele pairs with P-value < 0.0001 were recorded as strong LD.
In the following analysis that using SSR markers, all the alleles that exist in with less than 10 accessions were set to missing. The number of subpopulations (K) of the materials was determined in Structure 2.3.4 (Pritchard et al. 2000). The admixture model was used, and the length of burn-in period and the number of MCMC (Markov Chain Monte Carlo) replications after burn-in were set to 100,000 and 500,000 respectively. K equals 2 to 10 was tested and 5 independent runs were made for each value of K. The log-likelihood mean value of the 5 runs at each K was used. The structure result was submitted to Structure Harvester (http://taylo r0.biolo gy. ucla.edu/struc tureH arves ter/), and the optimal K value was determined based on the ΔK (ΔK = mean(|L''K|)/ sd[L(K)]) due to the mean log-likelihood value increased over increased K (Evanno et al. 2005). The population structure matrix (Q) was generated based on the optimal K. The SSR marker based phylogenetic neighbor joining tree was calculated in Tassel 3 (Bradbury et al. 2007) and FigTree 1.4.3 was used to output the figure.

Association Mapping of TGW Trait
The associations between trait and SSR markers, and the effect of marker on the phenotype were calculated in Tassel 3 following the method described by Emon et al (2015) with slightly modified, and the MLM (mixed linear model) was used. The heterozygous loci were removed in the calculation. The population structure Q matrix was from the calculation above, and the Kinship (K) matrix was calculated in Tassel 3 Two thousands of permutation tests were used to help define the cut-off value (significance threshold) (Churchill and Doerge 1994). In the permutation, the original phenotype data was reshuffled and then performed association analysis in Tassel 3. Since no real associations between the SNPs and the 'simulated' phenotypes were expected, a threshold could be set based on these fake association mapping results.
In order to validate the SSR association mapping result and give a higher resolution of the associated loci, 57 of the 462 accessions were randomly selected and re-sequenced (Table S8), and their SNP/InDel variants were called. The associations between the trait and SNP/InDel variants were calculated in Tassel 5 (Bradbury et al. 2007), and the GLM (general linear model) was used. In the SNP/InDel association mapping, we only focused on the region near the significant SSR markers identified in the SSR association analysis. An interval size of ± 500 kb flanking the target SSR markers was used. Similarly to the SSR-TGW association mapping above, the cut-off values of the SNP/InDel variants and TGW trait association were done using 2000 times permutation in Tassel 5.