GBS is a low coverage sequencing technology which results in a very high missing rate, for example, in both the 955,120 and 362,008 SNP datasets it is above 50 %, and high levels of missing data can be a problem for downstream analysis like association mapping. Generally, imputation of missing data is not necessary for genetic diversity analysis. Imputation uses information from the other haplotypes to fill the missing data gaps, so basically it uses sample relatedness to create the new dataset. Using imputed data to analyze sample relatedness could bias results, showing close haplotypes to be closer than they really are and vice versa. Additionally, more layers of error to the genotyping data happened on the SNPs toward regions of the genome with more chances of getting sequenced (i.e., duplicated regions) and also regions with more chances of getting imputed (i.e., low divergence regions) during SNP calling and imputation processes. It has been shown that even with high levels of missing rate, the GBS data generated without imputation can be used for genetic diversity and population structure analysis in the temperate maize diversity panel (Romay et al. 2013); our study firstly showed that un-imputed GBS data are also appropriate for genetic diversity analysis in the tropical maize panel.
MAF of 5 % rather than MAF of 1 % was performed in this study. There are several reasons: (1) the genotyping error of GBS in temperate maize has been reported as 0.42 % based on the data from NAM (nested association mapping) populations, where few tropical materials were included (Glaubitz et al. 2014). Genotyping error of GBS will be much increased in tropical maize, but we did not find any report about the genotyping error rate of GBS estimated in a complete tropical maize panel. In this study, the reference genome used for SNP calling is B73, which is a temperate maize inbred line. When a temperate maize inbred line is used as reference to call SNPs for tropical maize inbred lines (i.e., CML panel tested in this study), the ascertainment bias and SNP calling error of GBS increase. A higher MAF (i.e., 5 %) might be helpful for reducing ascertainment bias and selecting more reliable SNPs. (2) GBS is a low coverage sequencing technology, which caused a very high missing rate and large number of SNPs with very low frequency on the genotyped samples, especially in a broad genetic diverse panel. In this study, a higher MAF (i.e., 5 %) was performed in a broad genetic diverse panel for SNP filtering, which may reduce ascertainment bias and select more reliable SNPs for further molecular characterization work. However, a much lower MAF (i.e., 1 %) is recommended for filtering SNPs in a lower diversity panel, which may highly increase the power for further genetic diversity analysis. (3) Additionally, a very high marker density, i.e., 362,008 filtered SNPs with an average marker density of 1 SNP/6 kb, was used in this study compared with previous studies. Marker density is one of the strong points of this study on increasing the power of further analyses. Compared with MAF of 1 %, the number of filtered SNPs with MAF of 5 % decreased from 499,297 to 362,008, and higher marker density obtained by filtering SNPs with a lower MAF may not affect the overall results.
The results of the kinship analysis performed in this study showed that kinship coefficients of 64 % of the paired lines were equal to 0 and only 2 % of them were above 0.05. This information reflected the uniqueness of most inbred lines in the current CML collection, since most of the CMLs were either not related or distantly related to each other. Our kinship coefficient results are similar to those of Wen et al. (2011), who reported that about 60 % of the pairwise kinship coefficients among 359 inbred maize lines were close to zero. However, they are much lower than those of Semagn et al. (2012), who reported that 79 % of the kinship coefficients among 450 inbred maize lines ranged from 0.05 to 0.50; these authors used a maize collection with narrow genetic divergence, as all the lines were developed and released mainly by CIMMYT’s eastern and southern Africa maize breeding programs.
Previous studies have measured LD decay distance in different germplasm collections with various kinds of low-to-medium density genotyping platforms. Compared with previous studies, the average LD decay according to the physical distance in this study was more rapid, and the average LD decay distance was smaller, with an average distance of 3.76 kb in the entire panel. Yan et al. (2009) reported that the average LD decay distance was 5–10 kb in a global maize collection of 632 lines, and Wu et al. (2014) measured that the average LD decay distance was 391 kb in a collection of 367 inbred lines widely used in maize breeding of China. The LD decay distance measured in this study was much smaller than that reported in the temperate maize collection, because tropical and subtropical lines are more diverse and contain more rare alleles.
In this study, the current CML collection was molecularly characterized by performing population structure, principal component and neighbor-joining cluster analyses. The results of these analyses revealed the population structure and clear genetic divergence between temperate and tropical inbred lines, which was in agreement with previous studies (Lu et al. 2009; Wen et al. 2011, 2012). Our results also showed a clear separation by environmental adaptation. Inbred lines from the three major environmental adaptations (i.e., Lowland Tropical, Subtropical/Mid-altitude and Highland Tropical) formed clear clusters. Most lines that are related by pedigree tended to cluster into the same group, which was basically consistent with CIMMYT maize breeding history. However, our results were different from those of several previous studies, where the authors reported a lack of clear clustering patterns in the CIMMYT germplasm based on environmental adaptation or mega-environment (Semagn et al. 2012; Xia et al. 2004, 2005). These differences could be explained by our use of high-density GBS SNPs, which may have increased the resolution of the genetic characterization analysis. In this study, 362,008 filtered SNPs with an average marker density of 1 SNP/6 kb or 11 SNPs/gene were finalized and used in further genetic characterization analyses, assuming that the maize genome is about 2400 Mb and there are approximately 32,000 genes in the maize genome (Yan et al. 2009).
Gene diversity values of the three Tropical subgroups were similar and higher than those of the Temperate subgroup, and the average genetic distance between the Temperate subgroup and each of the Tropical subgroups was greater than that between the Tropical subgroups. The greatest genetic distance was observed between the Lowland Tropical and Temperate subgroups, and the smallest genetic distance was observed between the Lowland Tropical and Subtropical/Mid-altitude subgroups. These observations are consistent with the current germplasm exchange patterns where there is constant flow of germplasm from the tropical program into the subtropical with little to no exchange with the highland program. The genetic distance between the Temperate subgroup and the Highland Tropical subgroup was smaller than that between the Temperate subgroup and the other two subgroups. Germplasm rarely occurs between the Temperate breeding program and highland tropical breeding program, because the Highland Tropical subgroup has narrow adaptation and the genetic distances estimated between the Temperate subgroup with other subgroups are not very accurate; especially with the Highland subgroup, small sample size of these two subgroups affected the accuracy of genetic distance estimation. In practice, Temperate breeding programs more frequently exchange germplasm with the Subtropical/Mid-altitude breeding program than the other breeding programs. These analyses revealed ample natural genetic diversity in tropical maize germplasm, suggesting that the current CML collection could be an important resource to help drive future genetic gains in maize breeding programs worldwide.
It has been shown that SSRs provide higher resolution in genetic diversity analyses, given that the power of one SSR is similar to that of ten SNPs for estimating population structure and relative kinship (Lu et al. 2009; Yan et al. 2009). This is because SNPs from array based on allele sharing are lower than the more polymorphic SSRs, and the maximum number of alleles per locus is restricted to two for bi-allelic SNPs. However, a number of SNP alleles may perform better than the same number of SSR alleles, since many more SNP loci with two alleles provide better genome coverage than a lower number of SSR loci with more alleles per locus. More weight should thus be given in genetic diversity analysis to the number of loci than to the number of alleles (Lu et al. 2009). Larger numbers of SNPs are required to replace the highly polymorphic SSRs. Several previous studies have shown the efficiency and power of SNP markers in genetic diversity analyses (Lu et al. 2009, 2013; Romay et al. 2013; Semagn et al. 2012; Wu et al. 2014). However, the use of SNP genotyping chips may cause ascertainment bias, which means markers developed to be polymorphic in one set of germplasm are likely to provide a biased estimate of diversity in another set of germplasm (Lu et al. 2009). The only way to fully remove the bias is to do de novo sequencing on all the samples; however, the cost of this is still prohibitive. Next-generation sequencing technologies, such as GBS, make low-cost, low-coverage, whole-genome sequencing widely available, which could reduce to some extent ascertainment bias in maize molecular characterization studies. Romay et al. (2013) genotyped 2815 maize inbred accessions from the USA national maize inbred seed bank with GBS, and 681,257 SNPs were developed and successfully used for analyzing the genetic diversity and population structure of this publicly available maize collection and for performing genome-wide association studies on simple and complex inherited traits. In this study, a large and diverse collection of 539 CMLs were genotyped with 955,690 GBS SNPs; these SNPs were called using haplotype information from a collection of more than 60,000 maize samples (the AllZeaGBSv2.7 Production Build) including temperate and tropical germplasm. The reference genome is from B73, a temperate maize inbred line. Of the 955,690 SNPs, 62 % in the collection are rare, which is a litter higher than the number found by Romay et al. (2013), who reported that more than half the SNPs are rare.
In the current collection, only about half the CMLs (i.e., 243 of 538) have heterotic information estimated based on pedigree information and combining ability tests through diallel and line-by-tester analyses. Since only a limited number of lines can be included in each combining ability test experiment, it is not possible to estimate the heterotic group and genetic relatedness of all maize lines in the current CML collection via one general combining ability test. Molecular marker analyses provide an alternative approach for large-scale genetic diversity characterization within a given germplasm collection. However, it has been reported several times that the heterotic patterns in tropical maize collections are still not clear compared with temperate maize, and the heterotic patterns estimated based on molecular markers are not fully consistent with those estimated based on combining ability tests and pedigree information (Lu et al. 2009; Semagn et al. 2012; Wen et al. 2012). In this study, we measured the genetic relatedness among all CMLs and our results confirmed this conclusion: the GBS SNPs were unable to separate heterotic groups A and B that were established based on combining ability tests. The difficulties of assigning lines to different heterotic groups are due to the diverse original and incomplete pedigree information, regardless of the marker system used. Shorter hybrid-oriented breeding history for tropical germplasm and use of different testers across breeding programs are probably the other important reasons. The same line can be heterotic group A or B depending on the tester used, which may result in mixing up of heterotic groups.
The creation of heterotic groups in maize is based on long-term selection. The development of heterotic groups in temperate maize started around 100 years ago, but heterotic group development work in tropical maize at CIMMYT began only three decades ago, in the mid-1980s. Most CMLs were derived from broad germplasm pools, populations and open-pollinated varieties; only more recently released CMLs were developed from bi-parental crosses using the pedigree breeding method. So it is easy to understand why heterotic patterns in tropical maize are still not clear. But we also found that most of the lines from heterotic group A and heterotic group B tend to cluster together in the Lowland Tropical and Subtropical/Mid-altitude subgroups, respectively. This suggests that short-term selection for hybrid performance has contributed to classifying tropical maize heterotic patterns at CIMMYT. Therefore, combining the current heterotic information based on combining ability tests and the genetic relationships inferred from molecular marker analyses may be the best strategy to define heterotic groups for future tropical maize improvement. The results of this research will also help breeders to understand how to utilize all the CMLs in other breeding activities, such as selecting parental lines, replacing appropriate testers, assigning heterotic groups and creating a core set of germplasm.