Introduction

Sorghum (Sorghum bicolor (L.) Moench) is the fifth leading cereal crop worldwide in both subsistence and commercial agriculture systems, particularly in semiarid regions such as sub-Saharan Africa and South Asia (National Research Council 1996; Mundia et al. 2019). Sorghum bicolor is divided into four groups according to their production characteristics and usage, including grain, forage, sweet sorghum, and broom. Compared to grain sorghum, course forage sorghum, often used for green chop, hay, silage, and pasture (Shoemaker and Bransby 2010), grows quickly, is morphologically taller and leafier, and matures later with a higher biomass and a juicy stem, but relatively lower seed yield (Harlan and De Wet 1972; Getachew et al. 2016). Recently, it has been identified as a potential target for sugar as well as the production of lignocellulosic biofuels (Mathur et al. 2017). Efforts to develop multipurpose forage sorghum cultivars or hybrids with superior stem quality, standability, high sugar levels, high dry matter production, and low prussic acid levels have been initiated using both classical and biotechnological approaches.

Forage sorghum typically grows over 150 cm tall (Getachew et al. 2016), primarily due to lignin, which improves strength and rigidity in cell walls, as well as mechanical support for leaf blades and stems (Frei 2013). Lignin is a phenylpropanoid polymer which possesses complex structure and composition, as well as a high molecular weight (Frei 2013; Liu et al. 2018). It mainly deposits in fibers, such as parenchyma bundle-sheath cells (Frei 2013; Liu et al. 2018). As a result of this trait, together with its tolerance to stresses and relatively high nutrient uptake efficiency, forage sorghum is globally cultivated in diverse geographical areas, even barren regions unsuitable for the production of food and feed (National Research Council 1996; Vermerris 2011; Liu et al. 2018; Mundia et al. 2019). However, lignin affects the digestibility of forage in animals and is considered both an unfavorable component of forages and an antinutritive in animals (Frei 2013). Reducing lignin while increasing sugar and cellulose content are current priorities in breeding, in order to improve digestibility by livestock. Sorghum bicolor is highly adaptable across an extensive geographic range; sources for the improvement of traits may be found in traditional varieties from Africa and Asia (National Research Council 1996; Casa et al. 2008; Upadhyaya et al. 2009; Vermerris 2011; Morris et al. 2013; Mathur et al. 2017; Mundia et al. 2019). However, the complex genetics for these remarkable traits of forage sorghum have not been well elucidated.

In-depth sequencing of whole genomes of diverse Sorghum bicolor populations is increasingly being used to identify genes and loci of underlying complex traits (Casa et al. 2008; Upadhyaya et al. 2009; Vermerris 2011; Morris et al. 2013; Mathur et al. 2017) at an unprecedented rate (The International HapMap Consortium 2007; The Wellcome Trust Case Control Consortium 2007). Genome-wide single nucleotide diversity scans (Casa et al. 2006; Zheng et al. 2011; Bouchet et al. 2012) and association studies (Morris et al. 2013) have been performed to investigate genome-wide and intraspecific variation for dissecting the genetic basis of important traits, and for tailor-designed breeding in sorghum. However, compared to maize (Jiao et al. 2012) and rice (Huang et al. 2010), less is known. Brown midrib (bmr) mutants in Sorghum bicolor (Porter et al. 1978) possess a reduced concentration and altered composition of lignin, as well as improved digestibility of cell walls, and are demonstrated to have desirable qualities for livestock forage development as well as the rapidly growing lignocellulosic biofuel industry (Cherney et al. 1991; Oliver et al. 2004). Several bmr mutants are involved in specific enzymatic steps of the monolignol biosynthetic pathway, such as cinnamyl alcohol dehydrogenase (CAD) (Pillonel et al. 1991; Saballos et al. 2012; Scully et al. 2016), among which a CAD allele (SbCAD2) with an 8-bp deletion in its 5′-untranslated region (UTR) was mapped and cloned in the sorghum germplasm line PI 595743 (Li and Huang 2017). Barcode multiplexing and high-throughput sequencing (Elshire et al. 2011) were applied to develop thousands of markers for rapid genotyping of four classical dwarfing loci (Quinby 1975) in sorghum at a low cost. Recently, Zheng et al. (2011) sequenced the inbred lines of two sweet and one grain sorghum and identified an arrangement of almost 1500 genes differentiating the two sorghums. Morris et al. (2013) analyzed the diversity of sorghum and mapped several classical loci and genes for plant height and inflorescence architecture with genome-wide association analysis (GWAS) by characterizing 971 sorghum accessions at 265,487 SNPs with genotyping-by-sequencing (GBS).

The genetic diversity among forage sorghum accessions has been studied with simple sequence repeats (SSR) (Ping et al. 2018) and sequence-related amplified polymorphism (SRAP) (Zhang et al. 2018). Zhang et al. (2018) revealed the narrow SRAP genetic diversity of 21 introduced Sorghum bicolor bmr sterile lines, whereas Ping et al. (2018) reported relatively abundant genetic diversity among 48 forage sorghum germplasms by 109 SSRs. To better understand the genetic diversity of forage sorghum, we characterized 206 lines collected worldwide by genome sequencing a much larger sample population than those used by previous studies (Ping et al. 2018; Zhang et al. 2018). GWAS was applied to identify genes underlying lignin content in forage sorghum.

Materials and methods

Plant materials

A set of 206 diverse forage sorghum accessions were collected worldwide, including 30 male sterile lines, 30 maintainer lines, and 146 fertility restorer lines, representing various geographical origins and states of improvement.

All 206 sorghum accessions were grown for phenotypic evaluation of agronomic traits and quality in the experimental station of Jinzhong City, Shanxi Province, China (112° 42′ N and 37° 36′ E) from 2016 to 2018 cropping seasons. Using a randomized complete block design with three replications, seeds were planted in 1236-row plots. Plots were 5 m in length with a row spacing of 0.5 m. Sowing density was approximately 20 seeds/row. At the end of August 2018, in order to measure quality-related components (including lignin), 206 forage sorghum accessions were sampled in batches according to their heading dates. The fresh samples were ground using a pulverizer into particles 1–2 cm in size, and evenly divided into 500 g in Kraft paper bags. The samples were first dried at 65 °C for 72 h after inactivation at 105 °C for 1 h in a programmable oven. The dried samples were then analyzed for quality-related components by the Beijing Quality Testing Center, China.

Determination of lignin content

The 100-g samples per accession were dried to less than 10% moisture and milled using a cyclone mill to a 1-mm particle size. The sieved powder was collected for spectral scanning with FOSS 5000 near infrared (NIR) analyzer (FOSS, Hillerød, Denmark). Thirty-two scans were performed for each sample using the wavelength range of 1100 to 2500 with 2-nm data spacing. Spectral data was collected using Vision software (version 3.5.0.0). Each sample was loaded and scanned 2 times, and the average spectrum was regarded as the sample spectrum used for NIR analysis.

DNA isolation and genome sequencing

Following 3 weeks of growth, genomic DNA was extracted from the young leaves of one sorghum plant from each accession. DNA extraction was performed using the cetyltrimethylammonium bromide (CTAB) method (Murray and Thompson 1980) for genome sequencing on the Illumina HiSeq system (PersonalBio, Shanghai, China). Briefly, following quality assessment, genomic DNA was randomly fragmented using sonification and size-fractionated through electrophoresis, and DNA fragments of the desired length (400 bp) were gel purified. Using an insert size of approximately 400 bp, Illumina paired-end sequencing libraries were constructed, following the manufacturer’s instructions (Illumina, San Diego, USA). Libraries were sequenced as 2X150 bp paired reads.

Read alignment and variation calling

First, raw data (raw reads) in fastq format were cleaned using in-house perl scripts to remove reads containing adapter, ploy-N, and low-quality reads. Meanwhile, high-quality clean data was calculated for Q20, Q40, and GC contents, and all downstream analyses were performed. Clean reads were mapped to the reference genome BTx623 (v3.1) (McCormick et al. 2018) (ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/plant/Sorghum_bicolor/latest_assembly_versions/GCA_000003195.3_Sorghum_bicolor_NCBIv3) with Burrows-Wheeler Alignment (BWA) software (Li and Durbin 2009) (version 0.7.12-r1039) using the default parameters. SNP and INDEL were called by the Genome Analysis Toolkit (McKenna et al. 2010) (GATK, version: 3.8). All raw SNPs were filtrated with a Fisher test of Strand Bias (FS) ≤ 60, HaplotypeScore ≤ 13.0, Mapping Quality (MQ) ≥ 40, Quality Depth (QD) ≥ 2, ReadPosRankSum ≥ -8.0, MQRankSum > -12.5, alternative allele called in ≥4 reads, and removal of SNPcluster sites. Indels were filtrated with a Fisher Test of Strand Bias (FS) ≤ 200, Quality Depth (QD) ≥ 2, and ReadPosRankSum ≥ -20. Annotations of SNP and INDEL were performed based on the gene model set v2.0 from Phytozome v11.0 using ANNOVAR (Wang et al. 2010) (version: 2015-03-22).

Population genetics analysis and GWAS

In the population genetics analysis and GWAS, only SNPs with minor allele frequency (MAF) ≥ 0.05 were used. Hierarchical population structure was developed using the MEGA 6.0 software (http://www.megasoftware.net/) with bootstrap replications of 1000. Principal component analysis (PCA) of the population was performed using R package. Genetic relatedness (additive genetic effect matrix) was determined using the genome-wide SNP data (Endelman and Jannink 2012). Whole-genome linkage disequilibrium (LD) was separately calculated (r2) for all accessions and each subgroup using TASSEL 3.0 (Bradbury et al. 2007). GWAS was performed using the Efficient Mixed-Model Association eXpedited (EMMAX, beta version) (Kang et al. 2010) software package, and in order to determine the threshold, distribution of the most significant P values across the 1000 replicates was used. −log10(P value) > 6 was used as the threshold for all binary-like quantitative traits. The matrix of pairwise genetic distances, which were derived from the simple matching coefficients, as the variance-covariance matrix of the random effects, was also calculated by EMMAX.

Results

Phylogenetic relationships of 206 diverse sorghum accessions

For the 206 accessions, whole-genome resequencing (WGRS) data was generated. This includes 89 Chinese core collection accessions, 4 landraces, and 113 other accessions from different sorghum-producing nations and areas, representing the genetic, geographic, and morphological diversity of forage sorghum. In total, 7.31 billion paired-end reads were generated with sequencing depths ranging from 3.56× to 14.06× and genome coverage of approximately 90.01%. Following mapping against the reference genome and the SNP calling, a total of 14,570,430 SNPs with MAF ≥ 0.05 and 1,967,033 indels were identified. In the PCA, based on all collected SNPs, the first and second principal coordinate (PC) explained 17.87% and 5.16% of the molecular variance for SNPs, respectively (Fig. 1a). Genetic relationships among 206 accessions reflected a more diverse spread of sorghum accessions, as well as breeding lines from Asia and America, Asia × America, Asia × Australia, and a limited genetic diversity of European accessions (Fig. 1b).

Fig. 1
figure 1

Genetic structure of 206 worldwide sorghum accessions using SNPs detected in whole-genome resequencing data. a PCA, b genetic relatedness (additive genetic effect matrix) using SNPs detected in whole-genome resequencing data, c neighbor joining tree constructed from simple matching distance of all SNPs, and d genome-wide average LD decay

Analyses based on pairwise dissimilarity of all the SNPs in the collection using neighbor joining revealed two distinct groups (Fig. 1c). Group I contained 60 American accessions and one Asia × America, whereas group II included all breeding lines and accessions. Accessions from Europe, America, and Asia, and breeding lines of Asia × America and Asia × Australia were predominantly but not exclusively divided into several subgroups or minor groups within group II, showing the prevalence of genetic mixtures of breeding lines and landraces in sorghum breeding.

Whole-genome patterns of linkage disequilibrium

Whole-genome patterns of linkage disequilibrium (LD) were analyzed separately for Asia and America, Asia × America, Asia × Australia, and European accessions using the SNP data. The LD decay rate was measured as the chromosomal distance at which the average pairwise correlation coefficient (r2) dropped to half its maximum value. On average, LD decays to 50% of its initial value by 1 kb and to background levels (r2 < 0.1) within 150 kb. A rapid decline in LD was observed with increasing physical distance on subgroups of Asia and America, Asia × America, Asia × Australia, and European accessions (Fig. 1d). The decay rate varied widely across the genomes of Asia and America, Asia × America, Asia × Australia, and European subgroups, which will presumably lead to differential resolutions of association mapping at different genomic regions. European accessions had the lowest LD decay rate, which might be a result of self-fertilization coupled with a relatively small effective population size.

Genome-wide associations with agronomic traits

In total, 19 agronomic traits and 18 feedable quality traits were phenotyped in the population of 206 accessions in three different experiments from 2016 to 2018 cropping seasons using a randomized complete block design with three replications. Using the pooled phenotyping data, a total of 14,570,430 SNP markers with MAF ≥ 0.05 from 206 accessions were analyzed for 19 agronomic traits and 18 feedable quality traits. EMMAX was used to estimate marker-trait associations (MTAs). As shown in Fig. 2a, the lignin content values across all 206 sorghum accessions showed continuous frequency distribution with two peaks, suggesting polygenic and quantitative inheritance of lignin content. A total of 9 candidate regions (−log(P) > 6) spanning 100 kb were significantly associated with lignin content on chromosomes 1, 4, 6, 7, and 9 (Fig. 2b and c), supported by the strong LD among associated SNPs.

Fig. 2
figure 2

GWAS of the sorghum lignin content. a Distribution of the lignin content values across 206 sorghum accessions. b Manhattan plot for lignin content. The −log10 P values from a genome-wide scan are plotted against the position on each of the 10 chromosomes. The horizontal dashed line indicates the genome-wide significance threshold (P = 5 × 10−8). c Quantile plot for lignin content. The horizontal axis shows −log10 transformed expected P values, and the vertical axis indicates −log10 transformed observed P values

Geographic distribution of QTL alleles and lignin content variation between haplotypes

The geographic distribution for SNPs of 9 QTLs in 206 accessions was characterized in order to acquire increased knowledge of haplotype spread and origin linked to lignin content (Fig. 3). Allelic distribution for C1_21143564 showed that European accessions carried haplotypes CT and TT only, whereas the America × Asia breeding lines contained haplotypes CC and TT. The lignin content in haplotype CC was significantly lower than those found in both haplotypes TT and CT (Fig. 3a). For C1_S24250037, European sorghum only carried haplotype TT, whereas the America × Asia breeding lines did not contain haplotype TC. Haplotype CC was significantly associated with lower lignin content compared with haplotypes TC and TT (Fig. 3b). For C4_S4831311, no European accession carried haplotype TT that was significantly related to lower lignin content (Fig. 3c). European accessions carried haplotype AA only, whereas the America × Asia breeding lines did not contain haplotype AG for C4_S5402582 (Fig. 3d). The lignin content in haplotype GG was significantly lower than that in both haplotypes AG and AA. For C6_S19087442, only haplotypes TC and TT existed in all accessions with no significant effect on lignin content (Fig. 3e). For C7_S4285582, European accessions carried haplotype GG only, whereas America × Asia breeding lines did not contain haplotype AG. Haplotype AA was significantly associated with lower lignin content compared to haplotypes AG and GG (Fig. 3f). The America × Asia breeding lines did not contain haplotypes AA and TA, whereas the European accession did not carry haplotype AA for C9_S57393274. The lignin content in haplotype TT was significantly lower than that of haplotype TA but insignificantly different from haplotype AA (Fig. 3g).

Fig. 3
figure 3figure 3figure 3

The geographic distribution of alleles and lignin content variation among different haplotypes in each QTL. a C1_S21143564. b C1_S24250037. c C4_S4831311. d C4_S5402582. e C6_S19087442. f C7_S4285582. g C9_S57393274

Genetic collinearity analysis

Nine candidate regions for lignin content revealed 255,689 and 253,400 SNPs in the regions on chromosomes 1, 433, 000, and 435, 667 SNPs within the regions on chromosome 4, and 200,000 SNPs in the regions on chromosomes 6, 7, and 9 (Table 1), thereby covering 184 genes. Collinearity analysis showed a high degree of collinearity in 13 of 184 sorghum lignin-related loci with previously reported gene families in grasses, including corn and rice (Table 2). SORBI_3001G222700 is highly collinear with Wda1 (Os10g0471100) in rice, encoding a protein similar to those involved in the production of wax.

Table 1 Candidate QTL regions significantly associated with lignin content
Table 2 Gene collinearity of lignin content between sorghum and crops (corn and rice)

Discussion

The improvement of crop productivity has been greatly facilitated by pyramiding several advantageous traits into a singular type. However, the availability of genomic resources for the genetic dissection of complex traits in Sorghum bicolor is limited in comparison to those in rice (Huang et al. 2010) and maize (Jiao et al. 2012). Targeted efforts to exploit the tremendous genetic diversity in Sorghum bicolor germplasm collections desirable traits are underway (Zheng et al. 2011; Morris et al. 2013; Ordonia et al. 2016). Systematic assessment and application of available germplasm is important for the improvement of forage sorghum, through access to allelic variations that affect essential agronomic traits such as biomass, and feedable quality traits such as lignin content. In the present study, we characterized 14,570,430 SNPs and 1,967,033 indels, creating a valuable resource for both future gene-phenotype studies and molecular breeding of Sorghum bicolor, in addition to related species. The population structure of 206 Sorghum bicolor germplasm was characterized using genome-wide SNPs, which are a priority in decisions regarding selection strategies and plant breeding options for elite forage sorghum varieties and hybrids. Genetic relationships among 206 accessions revealed a more diverse spread of sorghum accessions and breeding lines from Asia and America and their genetically improved variety, and a limited genetic diversity in European accessions, adding to the knowledge of historical crop diffusion processes, within and across the agroclimatic zones of America, Asia, and Europe.

The quantity of genomic data increases understanding of the allelic variation in the genetic resource collections; it will also encourage breeders to propose an efficient path for improvement of varieties by design. Morris et al. (2013) traced the independent spread of several haplotypes containing alleles for short stature and long inflorescence branches. In total, we identified 9 QTLs for lignin content and observed evidence of the independent spread of multiple alleles controlling lignin content (Fig. 3a–g). In general, favorable alleles for lower lignin content are observed in Asian and American accessions and their genetically improved variety, whereas unfavorable alleles were found in the European sorghum population. Interestingly, none of the alleles for lower lignin content at loci C1_S21143564, C1_S24250037, C4_S5402582, and C7_S4285582 were restricted to European accessions (Fig. 3a–g). This suggests that it is not QTL that is responsible for the lower lignin content phenotype in European accessions. The alleles for lignin content aid in reductions to molecular diversity and are found in specific genomic regions, such as on C1_S21143564, C1_S24250037, C4_S5402582, and C7_S4285582; specifically, those that have experienced selective sweeps. The lignin-related gene sets were significantly enriched in the lignin-associated biological process of plant-type cell wall organization (P < 0.01). In addition to the identification of 9 QTLs for lignin content, we revealed a high collinearity in 13 of 184 sorghum lignin-related loci with previously reported genes in crops such as maize and rice (Table 2). SORBI_3001G222700 might act like Wda1 (Os10g0471100) (Jung et al. 2006) in rice to encode proteins highly similar to those involved in the production of wax. This information will guide breeders in developing a better strategy to improve forage sorghum.

Sorghum bicolor has demonstrated qualities such as high biomass yield, superior drought tolerance, and efficient use of nutrients, and serves as one of the leading biomass feedstocks for livestock. Lignin imparts strength and rigidity to plant cell walls as well as mechanical support for stems and leaf blades (Buxton and Redfearn 1997; Frei 2013; Liu et al. 2018). Research has demonstrated that lignin and lignin-like compounds, alongside other constituents of cell walls, provide resistance to disease, insects, cold temperatures, and other biotic and abiotic stresses (Rest et al. 2006; Tripathi et al. 2003; Ithal et al. 2007; Shadle et al. 2007). Lignin is an important component of the secondary cell wall, but is also a major impediment to biomass feedable utilization for feedstocks (Frei 2013), due to interference with microbial degradation of fiber polysaccharides. It acts as a physical barrier and is cross-linked to polysaccharides by ferulate bridges (Buxton and Redfearn 1997). Thus, forage sorghum breeding practices for improving digestibility without negative effects on growth and survival in field environments are challenging, due to the level of reduction available for lignin and other cell wall components. Selection for low lignin concentration in alfalfa resulted in less growth vigor and poorer field survival (Buxton and Casler 1993). Therefore, an improvement strategy for lignin that minimizes developmental defects in lignin-modified plants involves a set of genes that coordinate composition of lignin synthesis, deposition, and integration into the cell wall network. MTAs and accessions containing desirable alleles, as identified in this study, will help speed up the rate of improvement of forage sorghum varieties and hybrids, with a desirable range of lignin content using genomics-assisted breeding with sequence variation. Further studies are needed to define the causal genes for these and other MTAs, in order to improve understanding of the underlying regulatory mechanisms of lignin production. This in turn will aid in the long-term goal of manipulation of cell wall structure in biomass feedstocks.

In conclusion, genomic data for the worldwide collection of forage sorghum, inferences drawn from their analysis, and MTAs for agronomically relevant traits collectively provide valuable resources to accelerate genetic gains in programs dedicated to the improvement of forage sorghum. The candidate genes participating in lignin content lay the foundation for studying the potential use of functional genomics methodologies to validate the function and effect of candidate genes in regulating lignin content.