Introduction

(-)-Epicatechin (EC) is a flavan-3-ol monomer unit that belongs to a subclass of flavonoids (Abdulkhaleq et al. 2017). Flavan-3-ols, including catechin and epicatechin, are polyphenolic compounds that polymerize as oligomers or polymers to form proanthocyanidins (PAs), also known as condensed tannins, as the end products of the central phenylpropanoid pathway (Ha et al. 2018; Rauf et al. 2019). Studies on PA digestion indicate only monomers, dimers, and trimers of flavan-3-ols are absorbed by the gut (Beecher 2009; Cires et al. 2017). Higher degree PA oligomers and polymers accumulate in the gut lumen and metabolized in the colon by the microbiota, leading to the production of low molecular weight compounds (Cires et al. 2017). EC also contributes to the flavor and astringency of food (Liu et al. 2013, 2015). Consumptions of EC provides diverse benefits to human health due to its antioxidant, anti-inflammatory, antitumor, and cardioprotective properties, which help reduce the risk of onset or progression of cerebrovascular damage, age-related cognitive decline, Alzheimer’s disease, and depression (Gutierrez-Salmean et al. 2014; Bonetti et al. 2017; Prakash et al. 2019). EC and PAs provide 20-fold greater and more stable protection against free radicals than vitamin C and 50-fold better protection than vitamin E (Zhong et al. 2018). EC also contributes to brown pigmentation across a wide range of species, including soybean, Arabidopsis, and tea (David et al. 2014; Ha et al. 2018; Jun et al. 2018; Tan et al. 2020). Even in the absence of active oxidative enzymes, EC can be automatically oxidized to a brown form (Tan et al. 2020) that is used as a natural coloring agent by the food industry (Koh et al. 2014). EC shows promise as a supplement for sarcopenia patients due to its apparent ability to mitigate the loss of muscle mass and reduction in skeletal muscle quality (Gutierrez-Salmean et al. 2014; Zbinden-Foncea et al. 2022).

Soybean (Glycine max (L.) Merr.) is one of the crops grown most widely across the world to provide protein and oil for human food and animal feed. Traditional soybean breeding programs have enhanced agronomic traits, including yield, disease and insect-resistance, and grain quality (Sudaric et al. 2006; Flavio and Coelho 2013; Yang et al. 2020; Zhang et al. 2022;). As soybean seeds are rich in flavonoid metabolites, however, there is growing interest in the development of varieties rich in one or more phytochemical fractions, as such varieties will contribute to improvements in human health and the development of new market opportunities (Palmer and Kilen 1987; Koh et al. 2014; Yang et al. 2016; Miladinović et al. 2019; Wu et al. 2020). Many cultivated soybean varieties have yellow-colored seed coats and hilum, and contain no EC or PAs; however, EC and PAs are present in varieties with brown or black seed coats and hilum at levels that vary by variety (Ha et al. 2018; Lim et al. 2021; Lu et al. 2021). In colored soybean seeds, the PA component consists almost exclusively of EC (Jun et al. 2018). The absence of pigmentation in the seed coat and hilum results from a naturally occurring siRNA mechanism that acts via clusters of inverted gene repeat to target transcription of the CHALCONE SYNTHASE (CHS) gene family at the I (inhibitor) locus on chromosome 8 (Cho et al. 2019, 2000). Many isogenic recessive mutations in the 27 kb inverted repeat clusters of CHS give rise to black or brown pigmentation, either in the hilum alone or in both the hilum and seed coat. When dominant I or ii alleles are present, CHS siRNAs accumulate only in the seed coat, not in the cotyledons, roots, leaves, or stems (Tuteja et al. 2009; Cho et al. 2013). The concentration of EC, however, cannot be determined solely from observations of seed coat and hilum color, thus the factors determining the levels of EC in colored soybean seed coats remain to be revealed.

This study was designed to identify quantitative trait loci (QTLs) associated with EC and hilum color in a population of soybean recombinant inbred lines (RILs) derived from crosses between the mapping parents, Jinpung and IT109098 (Fig. 1a). Jinpung is a Korean cultivar that has yellow seed coats with a tawny-colored hilum; IT109098 is a Korean landrace with greenish-brown seed coats and a dark brown hilum. A dense linkage map constructed with single nucleotide polymorphism (SNP) markers was used to identify QTLs in the RILs. We used sequence variation analysis of the mapping parents and expression level changes in specific seed growth stages in three soybean genotypes (Williams82, IT109098, and Hwangkeum) to identify candidate genes responsible for variation in EC levels. In addition, we performed allelic stacking studies to reveal which combinations of alleles were critical in determining the EC content of soybean seeds.

Fig. 1
figure 1

Seed characteristics of the parental lines Jinpung and IT109098 and a mapping population of recombinant inbred lines (RILs) derived from Jinpung × IT109098. a Left: Jinpung has a yellow seed coat color; right: IT109098 has a green–brown seed coat color. b RILs with varying hilum colors from 0 to 4. Scale bar: 10 mm. (Color figure online)

Materials and methods

Plant materials and determination of EC content

A mapping population was constructed on the Seoul National University Experimental Farm (Suwon, Republic of Korea, 37.3′ N, 127.0′ E) in 2016 by crossing two soybean varieties, Jinpung and IT109098. This population contains 235 F7 and F8 RILs produced via single-seed descent.

The experiment was carried out over two consecutive years, 2020 and 2021. In each year, we harvested three fully grown soybean plants of each genotype and determined the EC content of their seeds (three biological replications). Seeds were ground in a WSG-9100 coffee grinder (Wiswell, Seoul, Korea), and 3 g of each sample was collected, dissolved in 30 ml of 80% methanol (Fisher Scientific, Hampton, NH, USA), and mixed by vortexing for 1 min. Samples were sonicated for one hour using a Powersonic 420 sonicator (Hwasin Technology, Daegu, Korea), with the inclusion of 2–3 ice packs to mitigate heat degradation in range of 23–30 °C. Subsequently, they were incubated at 22 °C for 24 h in an HB-201SF shaking incubator (Hanbaek Science, Bucheon, Korea). EC was extracted from samples as follows: 1 ml of supernatant was filtered through a 0.45 μm filter (Toyo Roshi Kaisha Ltd., Tokyo, Japan) prior to analysis by high performance liquid chromatography (HPLC) using a Dionex ultimate 3000 system (Thermo Fisher, Waltham, MA, USA; pump, autosampler, oven) and Shodex RI-101 refractive index detector (Shodex, Tokyo, Japan). The EC standard (> 98% purity by HPLC) was purchased from Sigma-Aldrich (St. Louis, MO, USA). Quantitative analysis of epicatechin was conducted using an Inno C-18 column (YoungjinBiochrom, Sungnam, Korea; 5 μm, 4.6 × 250 mm) under sub-ambient temperature conditions. The mobile phases consisted of 0.3% trifluoroacetic acid in deionized water (solvent A) and acetonitrile (solvent B) (JT Baker, Center Valley, PA, USA). The gradient profile was set as follows: from 0 to 1 min, 10% B; from 1 to 20 min, 10–25% B; from 20 to 30 min, 25–100% B; and from 30 to 35 min, returning to 10% B. The flow rate of the solvent was maintained at 1.0 mL per minute, and the injection volume was 10 μL. UV absorption was monitored at 280 nm.

Phenotypic evaluation

The hilum color of each genotype (three biological replications) was scored on a scale of 0–4 (Fig. 1b) via observations of 100 seeds after harvest. Statistical analyses of all phenotypic traits, including normality tests, t-tests, and correlations were conducted using R v4.2.1 software (r-project.org). The broad sense heritability (H2) for each trait was calculated using the ‘lme4’ package in R, following the formulae (Wyman and Baker 1991):

$$ H^{2} = \frac{{V_{G} }}{{V_{G} + V_{E} }} = \frac{{\sigma_{G}^{2} }}{{\left( {\sigma_{G}^{2} + \frac{{\sigma_{GE}^{2} }}{n} + \frac{{\sigma_{E}^{2} }}{nr}} \right)}} $$

where VG and \({\sigma }_{G}^{2}\) represent the genetic variance, VE represents the environmental variance, \({\sigma }_{GE}^{2}\) refers to the genotype-environment interaction variance, \({\sigma }_{E}^{2}\) refers to the variance error. The term “n” refers to the number of environments, while “r” refers to the number of replicates per environment.

Genetic map construction and QTL analysis

A genetic map was constructed for the Jingpung × IT109098 RIL population following Elshire’s genotype-by-sequencing (GBS) pipeline (Elshire et al. 2011). Seeds of each genotype (Jingpung, IT109098, and 235 RILs) were grown to soybean growth stage V4 and fresh young leaves were collected. Genomic DNA was extracted using a GeneAll® Exgene™ Plant SV Kit (GeneAll®, Seoul, Korea). Resequencing of Jinpung and IT1090908 was performed for paired-end reads with a sequencing depth of 10× and 26× , respectively. DNA sequencing of RILs was performed for single-end reads using an Illumina platform (Illumina Inc., San Diego, CA, USA). The Glycine max reference genome (Wm82.a2.v1) was obtained from the Phytozome website (phtozome.jgi.doe.gov). FASTA files of raw reads were processed with Trimmomatic v0.39 software (Bolger et al. 2014); sequences were mapped to the reference genome using Burrows-Wheeler Aligner software (Li 2013). Sequence variants from genotypes were called with BCFtools mpileup. Only variants with quality score > 30 and depth > 5 were retained; small insertions and deletions (InDels) were removed using VCFtools to identify the segregating SNPs used in further analysis. SNPs were further filtered by a missing cutoff of 0.7. The genetic map was constructed via JoinMap® v4.1 (Ooijen 2011). The QTL analysis was performed using QTL IciMapping v4.2.53 (Meng et al. 2015) software, with options for inclusive composite interval mapping with additive effects selected. The mapping steps were set to 1.0 centimorgan (cM) intervals and the LOD threshold was calculated with 1,000 permutation tests with 0.05 Type I error.

Survey of sequence variance in QTL regions

Sequence variations between protein-coding genes (including ~ 2 kb up and downstream flanking regions) in the mapping parents were analyzed to identify candidates responsible for the QTL effects. The resequencing data from the mapping parents were mapped to the reference genome (Wm82.a2.v1) using Burrows-Wheeler Aligner software (Li 2013), and variants were called using BCFtools. Variants with quality score > 30 and depth > 3 (different in filter parameters from those used in genetic map construction) were retained using VCFtools. SnpEff (Cingolani et al. 2012) software was used to predict the changes in, and impacts of, variant SNPs and InDels within the QTL regions. Annotations of orthologous regions of the Arabidopsis thaliana genome were obtained from TAIR (Araport11, arabidopsis.org) and used to identify putative functions of soybean genes within the QTL regions.

Phenotypic variance via allele combinations

Variant SNPs called and filtered during the construction of the genetic map were used to identify alleles derived from either of the mapping parents in two QTLs, qEC06 (Chr06: 18,555,043–18,864,382) and qEC08 (Chr08: 8,771,172–9,214,678); the physical position of the latter QTL spanned near the I locus (Chr08: 8,002,534–8,771,172). Only homologous alleles present in these regions were filtered. Heterozygous alleles and alleles whose zygosity could not be determined due to missing values in marker positions within the regions were removed. This identified three or more RILs with identical combination of alleles within these regions for use in allele-stacking studies and to determine differences in EC content or hilum color among allele combinations.

Identification of differentially expressed genes within qEC08

Sequence read archive (SRA) files of RNA sequencing data obtained previously (Ha et al. 2018) from seeds of soybean variety Hwangkeum at the R5 (SRA5838836) and R7 (SRA5838835) growth stages, and from seeds of IT109098 at R5 (SRA5838834) and R7 (SRA5838833) growth stages were translated into FASTA format using the Sra-toolkit software with fastq-dump option (Leinonen et al. 2011); raw reads were processed with Trimmomatic v0.39 software (Bolger et al. 2014), followed by mapping to the reference genome via Hisat2 (Kim et al. 2019). FeatureCounts (Liao et al. 2014) was used to obtain raw read counts for each transcript and edgeR (Robinson et al. 2010) was used to filter the read counts > 10 with trimmed mean of the M-values (TMM) normalization to obtain an estimation of relative RNA production levels in seeds of Hwangkeum and IT109098 at the R5 and R7 growth stages. Expression levels of genes within qEC08 in each genotype at these two stages was used to identify differentially-expressed genes (DEGs) and illustrated in a heatmap (Fig. S3).

Results

Seed characteristics of mapping population and RILs

Jinpung, a Korean cultivar with yellow seed coat and a tawny hilum, was crossed with IT109098, a Korean landrace with a greenish-brown seed coat and a brown hilum (Fig. 1a). Seeds of F7 and F8 RILs were used in a QTL analysis of seed coat and hilum color. The EC content of each genotype (Jinpung, IT109098, and RILs) was determined by HPLC. Jinpung seeds did not contain EC (EC content: 0.00 ± 0.00 μg/g in samples collected in both 2020 and 2021). EC was present in IT109098 seeds, with its levels showing a decline in 2021 (EC content: 834.3 ± 285.5 μg/g) compared to those in the previous year 2020 (EC content: 1538.9 ± 78.5 μg/g). Similarly, EC levels decreased for the RILs in 2021, exhibiting the mean EC content of 61.7 ± 181.0 μg/g (range 0.00 ± 0.00 to 906.36 μg/g) compared to the mean EC level in the RILs in 2020 of 175.5 ± 441.8 μg/g (range 0.00 ± 0.00 to 1965.33 μg/g) (Table 1; Fig. 2a). The heritability (H2) of EC content remained high across both study years (2020: H2 = 0.9763; 2021: H2 = 0.9974; Table 1).

Table 1 (-)-Epicatechin content and hilum grading in the parental lines and RILs derived from IT109098 × Jinpung
Fig. 2
figure 2

Distribution of phenotypic traits in RILs derived from Jinpung × IT109098. a (-)-Epicatechin (EC) content. b Hilum color. Histograms with different colors indicate phenotypes measured in two consecutive years (pink: 2020, green: 2021); dashed lines indicate parental phenotypes, ***P ≤ 0.0001. (Color figure online)

Hilum color of each genotype was assessed on a scale of 0 (no color) to 4 (dark brown). Seeds of F7 and F8 RILs segregated, showing a variety of hilum colors. The mean color scores of the RILs were 2.3 ± 1.2 in 2020 and 2.7 ± 1.1 in 2021 (Table 1; Fig. 1b). The parental lines exhibited significant differences in EC content and hilum color, and both traits showed bimodal distributions in their progeny (Table 1; Fig. 2).

EC content and hilum color, measured in two consecutive years, were highly correlated, both within and between traits (P ≤ 0.001). Across the 2 years of the study, EC content showed a positive correlation of 0.88 and hilum color of 0.87. The positive correlations between the traits were also shown: 0.26 for 2021 EC contents to both years (2020, 2021) of hilum color; while 0.30 and 0.31 for 2020 EC contents to 2020 and 2021 hilum color, respectively (Fig. 3). Since the correlations within traits measured over two consecutive years were similar, and significantly high, the data were combined for further analyses.

Fig. 3
figure 3

Correlation between (-)-epicatechin content (EC) and hilum color. ***P ≤ 0.001

Genetic map construction

The Jinpung and IT109098 resequencing reads were aligned with the soybean reference genome. The mapping depth of Jinpung and IT109098 were 10× and 26×, and covered 96.51% and 96.72% of the genome, respectively (Table S1). Reads from the mapping parents and population-wise GBS were used to identify 4,662,400 sequence variants. These were filtered to obtain a list of 6655 SNPs, of which 4819 differed between the parental lines (Table S2). After preprocessing and imputation, 4066 SNPs were used to construct a genetic map containing 20 linkage groups spanning 4351 cM across the soybean genome (Table S2; Fig. S1). The mean number of SNP markers on each chromosome was 203.3 and the mean distance between adjacent markers was 1.1 cM. The mean chromosome length was 217.55 cM (Table S3).

QTL analyses of EC content and hilum color

In total, five QTLs associated with EC content and hilum color that exceeded the LOD threshold of 3.37 were identified across Chr06, Chr08, and Chr19 (Table 2; Fig. 4a). Two QTLs for EC content located on Chr06 and Chr08 were named qEC06 and qEC08, respectively. qEC06 was flanked by the markers Chr06_18555043 and Chr06_18864382, and had a LOD score of 8.63 with a percentage variance explained (PVE) value of 10.50%. qEC08 was flanked by the markers Chr08_8771172 and Chr08_921478. It had a LOD score of 12.24 and a PVE value of 15.16%. Both QTLs showed negative allelic additive effects, suggesting that the beneficial allele originated in the parental line IT109098 (Table 2).

Table 2 Quantitative trait loci (QTLs) for (-)-epicatechin content and hilum color identified by inclusive composite interval mapping of 235 RILs derived from Jinpung × IT109098
Fig. 4
figure 4

Positions of quantitative trait loci (QTLs) associated with (-)-epicatechin (EC) content and hilum color. LOD scores above the threshold value of 3.3729 indicate the positions of QTL regions on the linkage map. a Positions of QTLs on soybean chromosomes 6, 8, and 19 associated with hilum color (upper chart) and (-)-epicatechin content (lower chart). b qEC08 is located close to (~ 200 kb) the I locus, which contains CHALCONE SYNTHASE (CHS) gene clusters

The analysis also identified four QTLs identified associated with hilum color (Table 2; Fig. 4a). The QTLs qHilum6-1 and qHilum6-2 were located on Chr06. qHilum6-1 had a LOD score of 13.96 and a PVE value of 6.69%; the LOD score and PVE value for qHilum6-2 were 6.36 and 2.80%, respectively. qHilum19-1 was located on Chr19 and had a LOD score of 11.82 and PVE value of 5.49%. qEC08, identified as a QTL for EC content, was also a major QTL for hilum color with a LOD score of 65.89 and a PVE value of 58.59% (Table 2). We therefore selected qEC06 and qEC08, major QTLs associated with EC contents in soybean seeds, for further analysis.

The QTL qEC06 (spanning ~ 309 kb of Chr06) contained ten protein-coding genes, whereas qEC08 (spanning ~ 443 kb of Chr08) contained 50 protein-coding genes (Table 2; Table S4). The putative functions of all these genes were determined using A. thaliana orthologs (Table S4); sequence variants in the coding regions between mapping parents were identified using the resequencing data (Table S5). qEC06 contained the T Locus that encodes FLAVONOID 3′-HYDROXYLASE (Glyma.06G202300; Table S4), which is associated with the biosynthesis of cyanidin-based anthocyanins and PAs in the seed coat, as well as with pubescence color (T: tawny; t: gray) (Knovinich et al. 2011, 2012; Senda et al. 2012; Song et al. 2016; Zabala and Vodkin 2014). Annotation of the genes within qEC08, however, did not uncover orthologs of Arabidopsis genes associated with EC or PA biosynthesis (Table S4), although 18 of the 50 genes in this QTL had variations in genic regions that would result in differences in protein sequence between the parental lines (Table S5).

To narrow down the pool of candidate genes responsible for qEC08’s effects on EC content and hilum color, we obtained expression data for all 50 genes in the QTL region from the SoyBase website (Soybase.org/soyseq). We also investigated the G. max RNA-seq Atlas to produce a heatmap of gene expression in various tissues and across stages of development (Severin et al. 2010). This revealed that seven genes located in qEC08 (Glyma.08G 114900; Glyma.08G115100; Glyma.08G115600; Glyma.08G115800; Glyma08G.116300; Glyma.08G116900; Glyma08G.119100) were highly expressed in seeds of all stages, but only one gene (Glyma.08G116300) showed high expression specifically in seeds (Fig. S2). Of the seven genes expressed highly in seeds, only two (Glyma08G114900 and Glyma08G116900) showed variation in protein sequence between the mapping parents (Table S5). We found that 24 out of 50 genes within qEC08 showed moderate expression in seeds; of those, seven genes (Glyma.08G114800; Glyma.08G115500; Glyma.08G116000; Glyma.08G116500; Glyma.08G116600; Glyma.08G117000; Glyma.08G119300) contained changes that produced variations in protein sequence between the mapping parents (Table S5, Fig. S2). The physical position of qEC08, however, was within ~ 200 kb of the I locus (Chr08_8,380,000 to Chr08_8,520,000), which harbors CHS gene clusters that act as a major switch determining the presence or absence of flavonoids (Fig. 4b) (Knovinich et al. 2011; Senda et al. 2012; Song et al. 2016; Cho et al. 2017). This suggested the necessity of further study to investigate if the I locus is responsible for the QTL identified qEC08. We therefore used resequencing data from the mapping parents to analyze the relationship between qEC08 and the I locus, and to investigate the effect of the I locus on EC content in the RIL population.

Allelic determination of the I locus in the mapping parents

The I locus determines the accumulation of anthocyanins and PAs in the epidermal layer of the seed coat. The dominant allele I and three recessive alleles (ii > ik > i, according to dominance) regulate the spatial distribution of these pigments and thus seed coat color may be restricted to the hilum, appear as a saddle-shape, or be fully pigmented (Knovinich et al. 2011; Senda et al. 2012; Cho et al. 2017). Deletions in the promoter sequences of dominant alleles (I or ii) result in pigmentated seed coats due to an increase in the total levels of CHS transcripts; this is similar to the effects of the recessive allele, i (Tuteja et al. 2004). In total, nine CHS genes were identified within the I locus based on orthology with the A. thaliana sequence (Table S5); their directionality and physical positions are shown in Fig. 5a. The sequences (including 2 kb up and downstream) of the CHS genes in Jinpung, which has a yellow seed coat and tawny hilum color, did not differ from those of the Williams82 reference genome, which contains the ii dominant allele (Fig. 5) (Tuteja and Vodkin 2008). Comparison of IT109098 with the reference genome revealed no changes in the protein-coding sequences of any of the nine CHS genes; however, identical sequence variants were found in the shared promoter regions (~ 2 kb) of CHS3a (Glyma.08G109300) and CHS1 (Glyma.08G109400), resulting from the inverted directionality of these genes (Fig. 5b). This suggested that changes in the promoter sequences of CHS3a and CHS1 may result in the dominant ii and recessive i alleles. The CHS gene sequences in Jinpung were identical to those of the reference genome, and thus it harbored the ii allele at the I locus. By contrast, the genome of IT109098, which has a pigmented seed coat and hilum (Fig. 1a), contained the recessive i allele, as a result of changes in its CHS promoter sequences (Fig. 5b).

Fig. 5
figure 5

Sequence variation between Jinpung and IT109098 in CHALCONE SYNTHASE (CHS) gene clusters in the I locus. a Locations and orientations of CHS genes on chromosome 8 (Chr08 8,380,000–8,520,000 bp). Red indicates CHS with sequence variations between Jinpung and IT109098. b Duplicated changes in promoter sequences (2 kb 5′ UTR) of CHS3a and CHS1 between Jinpung and IT109098. (Color figure online)

Allelic stacking and EC content

To investigate whether the I locus affected the level of EC content and hilum color, we determined which alleles were present at the I locus and in qEC08 in each RIL (Fig. 6). RILs that matched Jinpung or IT109098 at the I allele were designated iiJinpung or iIT109098, respectively; likewise, RILs were designated either qEC08Jinpung or qEC08IT109098 according to which qEC08 allele they had inherited. RILs with qEC08IT109098 alleles showed higher EC content than RILs with qEC08Jinpung alleles (P ≤ 0.001), although EC content showed a wide range (0.00–1312.06 μg/g), regardless of which allele was present at the I locus. Most RILs with qEC08Jinpung alleles did not contain any EC (0.00 μg/g) and there were no significant differences in EC levels between RILs with iiJinpung and iIT109098 alleles (Fig. 6a). An analysis of the effect of allelic combinations on hilum color showed a similar trend: RILs containing qEC08IT109098 alleles had significantly higher hilum color scores than RILs with qEC08Jinpung alleles, no matter which allele was present at the I locus (Figs. 1b, 6b). These data confirmed that qEC08 was a major QTL for both EC content and hilum color.

Fig. 6
figure 6

Variations in levels of (-)-epicatechin and differences in hilum color in RILs containing different allele combinations spanning the I locus and qEC08. a (-)-Epicatechin content by allele combination. b Hilum color variation by allele combination

Since an analysis of the allele combinations at two loci (I and qEC08) revealed that qEC08IT109098 determined the presence or absence of EC across the RIL population, we performed an additional allelic stacking analysis that included qEC06 to investigate the distribution of EC content across the genotypic combinations at three loci (Fig. 7a). RILs with alleles from IT109098 at both qEC06 and qEC08 showed higher levels of EC than RILs with other combinations of alleles, and genetic differences at the I allele did not have significant effects on EC content (Fig. 7a). The I locus was therefore omitted in subsequent analyses to reduce the complexity of data. Most RILs with qEC06IT109098 and qEC08IT109098 alleles had high levels of EC (mean EC content: 429.79 μg/g; range 0.00–1255.25 μg/g) and differed significantly from RILs with other allele combinations, which mostly lacked detectable EC (0.00 μg/g; Fig. 7b). This indicated that alleles of both qEC06 and qEC08 were critical for determining the presence or absence of EC as well as the levels of EC present in soybean seeds.

Fig. 7
figure 7

Effect of allelic combinations at qEC06, qEC08, and the I locus on epicatechin content in RILs. a Effect of alleles at qEC06, I, and qEC08. b Effect of alleles at qEC06 and qEC08. RILs with alleles from IT109098 at both qEC06 and qEC08 contained higher levels of (-)-epicatechin content than RILs with other combinations of alleles at the three loci

Expression of genes within qEC08

Following identification of protein-coding genes within qEC08 via referring annotations of Arabidopsis orthologs, we reanalyzed RNA sequencing data obtained previously (Ha et al. 2018) from seeds of the soybean variety Hwangkeum and from seeds of the parental line IT109098 at R5 and R7 growth stages to identify differentially-expressed genes (DEGs) within the qEC08 region. Hwangkeum has a yellow seed coat with tawny hilum; its seeds do not contain detectable levels of EC. The genotype of Hwangkeum at the I locus was identical to that of Williams82, the soybean reference genome, and also to the parental line Jinpung. To identify genes that were differentially expressed in the two genotypes, expression level fold change between seeds of IT109098 and Hwangkeum with Log2(FC) ≤ −1 and Log2(FC) ≥ 1 at the R5 and R7 stages were defined as DEGs. Of the 50 protein-coding genes present in qEC08, seven (Glyma.08G114900, Glyma.08G115100, Glyma.08G115600, Glyma.08G115800, Glyma.08G116300, Glyma.08G118400, and Glyma.08G119300) showed high expression, relative to other genes, in seeds over the two growth stages; five other genes (Glyma.08G115200, Glyma.08G116000, Glyma.08G116500, Glyma.08G117600, and Glyma.08G119100) showed moderate expression over the two growth stages. Glyma.08G116800, Glyma.08G116900, and Glyma.08G117500 were only expressed at the R5 stage in both genotypes. In addition, three genes (Glyma.08G114800, Glyma.08G118000, and Glyma.08G118800) were only expressed at the R5 stage in IT109098 (Fig. S3a). At the R5 stage, four genes (Glyma.08G114800, Glyma.08G116900, Glyma.08G118000, and Glyma.08G118100) were upregulated, and three genes (Glyma.08G114900, Glyma.08G116000, and Glyma.08G117000) were downregulated. At the R7 stage, five genes (Glyma.08G117200, Glyma.08G118200, Glyma.08G118400, Glyma.08G118700, and Glyma.08G118900) were upregulated, and eight genes (Glyma.08G115100, Glyma.08G115600, Glyma.08G116300, Glyma.08G116800, Glyma.08G117000, Glyma.08G117300, Glyma.08G118100, and Glyma.08G118800) were downregulated (Fig. S3b). Differential expression was observed in a greater number of genes during the R7 stage than during the R5 stage. Data from the RNA-seq Atlas from SoyBase indicated that all the DEGs identified in this study were expressed in seeds (Fig. S2). Of the 20 DEGs identified from the seeds of R5 and R7 stages, six (Glyma.08G114800, Glyma.08G114900, Glyma.08G116000, Glyma.08G116800, Glyma.08G116900, and Glyma.08G117000) contained sequence variations between the mapping parents Jinpung and IT109098 in their protein-coding regions, implying their protein sequences would also differ (Table S5).

Discussion

QTLs for EC content and hilum color

GBS of a soybean RIL population and resequencing of the parental lines, Jinpung and IT109098, were performed to identify SNP markers, which were used to construct a high-density genetic linkage map and identify QTLs associated with EC content and hilum color of soybean seeds. Data were collected across two consecutive years, 2020 and 2021. EC levels in seeds in 2021 were significantly lower than in 2020 (Table 1), which suggested a high sensitivity to environmental changes such as weather conditions (Yang et al. 2019). The analysis identified two major QTLs, qEC06 and qEC08, associated with EC content; both had high LOD scores and PVE values (Table 2; Fig. 4a). qEC08 was also identified as a major QTL determining hilum color, which showed a strong positive correlation with EC content (Table 2; Fig. 3). This indicated that, in this population of RILs, EC content affected the color of the hilum in seeds (Carpentieri-Pipolo 2015; Lu et al. 2021).

Extensive studies were conducted to identify the precise loci and genes responsible for soybean seed pigmentation, as well as for the biosynthesis of EC/PAs. At least five structural genes (CHALCONE SYNTHASE, CHS; FLAVONOID 3′-HYDROXYLASE, F3′H; FLAVONOID 3′,5′-HYDROXYLASE, F3′5′H; FLAVONONE 3-HYDROXYLASE, F3H; and ANTHOCYANIN REDUCTASE, ANR; these five genes are known, respectively, as the I, T, W1, Wp, and O loci) control the color of the seed coat; three of which (CHS, F3′H, and ANR) are also associated with EC biosynthesis in seeds (Knovinich et al. 2011, 2012; Senda et al. 2012; Zabala and Vodkin 2014; Song et al. 2016). Studies of the transcription factors regulating EC/PA biosynthesis in many plants suggest that MYB family transcription factors, along with members of the bHLH and WD40 families, form a tertiary complex that activates or represses sets of genes acting specifically in the anthocyanin and EC/PA pathways (Dixon et al. 2013; Lu et al. 2021). The QTL qEC06 contained 10 protein-coding genes including F3′H (Table S4), but qEC08 did not contain any known loci associated with EC/PA pathways, indicating that it is a novel QTL from a different genetic background. Further analysis of qEC08 was therefore required to identify the gene(s) responsible for its effects on both hilum color and EC content in seeds.

qEC08 candidate genes

Variation between the mapping parents in the gene sequences within the genomic region covered by qEC08, combined with differences in gene expression patterns in seeds, suggested that six genes (Glyma.08G114800, Glyma.08G114900 (emb2734), Glyma.08G116000 (SCAR 3), Glyma.08G116800, Glyma.08G116900 (RD21B), and Glyma.08G117000 (MEE62)) were likely to be involved in EC biosynthesis and hilum pigmentation (Table S5; Figs. S2, S3). An ortholog to Glyma.08G114900, AT5G19820 in Arabidopsis encodes an importin, karyopherin enabling the transport of the cytoplasmic HYL1 (KETCH1; also known as emb2734), that transport double-stranded RNA-binding protein, hyponastic leaves1 (HYL1), from the cytoplasm to the nucleus to constitute functional microprocessor for miRNA processing (Zhang et al. 2017; He et al. 2022). Although direct evidence of KETCH1 in involvement with EC biosynthesis and regulation in seeds of Arabidopsis or soybean is lacking, knockdown mutation of KETCH1 in Arabidopsis significantly reduced nuclear portion of HYL1 protein, decreasing miRNA accumulation and enhancing levels of pri-miRNAs and premiRNAs (Zhang et al. 2017). Recent studies suggest a few pri-miRNAs may encode small peptides that regulate the expression of target genes (Lauressergues 2015). Overexpression of pri-miR858a in Arabidopsis thaliana showed accumulated levels of flavonoids due to changes in the expression of genes in the phenylpropanoid pathway (Sharma et al. 2020). The phenylpropanoid pathway is also known to be associated with plant disease resistance, such as Powdery mildew in cucumber. Whole-genome resequencing of two cucumber lines different in Powdery mildew susceptibility identified Indels in KETCH1 and showed it as one of the candidate genes for Powdery mildew resistance (Xu et al. 2016). Glyma.08G116900 belongs to granulin repeat cysteine protease family protein that is orthologous to responsive to dehydration 21B (RD21B) in Arabidopsis. rd21 mutants in Arabidopsis and tomato showed significant susceptibility in the necrotrophic fungal pathogen Botrytis cinerea, in which flavonoids are positively correlated to higher resistance (Shindo et al. 2012; Zhang et al. 2020). Exogenous application of epicatechin enhanced resistance against Botrytis cinerea in apple (Zhao et al. 2022). At the R5 growth stage, Glyma.08G114900 was downregulated and Glyma.08G116900 was upregulated in IT109098 compared to Hwangkeum, suggesting these two are strong candidate genes for qEC08. Identification of a particular gene responsible for the effects of qEC08, however, requires further analyses including fine mapping and transformation.

High EC content in soybeans with qEC06 and qEC08

Soybean is one of the most economically important legumes. It is grown widely across the world and has diverse end-use purposes. In 2018, about 398 million tons of soybeans were produced, a volume that required ~ 6% of the arable land in use worldwide (Shea et al. 2020). Soybean seeds are rich in oils, but also in proteins and other nutraceutical ingredients, including isoflavones, phenolic acids, pinitol, phytic acid, saponins, dietary fiber, oligosaccharide, conjugated linoleic acids, and bio-active peptides. As a result, many soy-derived products are consumed in the forms of health-oriented and functional foods, cosmetics, and pharmaceuticals (Kim et al. 2022; Yun et al. 2016). The levels of these beneficial constituents differ, however, among soybean varieties because of differences in their genetic backgrounds (Clarke et al. 2013; Ha et al. 2018; Bi et al. 2022). Consistent with this, we found differences in the levels of EC between the mapping parents Jinpung and IT109098 (Table 1). Marker-assisted selection can effectively reduce the time and cost involved in breeding programs to develop soybean varieties enhanced in traits of interest (Sebastian et al. 2010). The QTLs, qEC06 and qEC08, identified in this study therefore provide valuable information that will enable the development of soybean varieties with enhanced EC content via targeted breeding programs. This will accelerate the development of high-quality soy products to meet consumer demand and expand market opportunities in the fields of food, cosmetics, and pharmaceuticals.