Introduction

Plants provide approximately 70% of the cooking oil and 50% of the dietary protein for humans, with soybean playing a significant role (Duan et al. 2023). Moreover, the natural nitrogen fixation provided by soybean roots reduces the use of fertilizers, making soybean a valuable crop for sustainable agricultural production. Therefore, it is of great importance to increase soybean yield and improve seed quality. Soybean yield is influenced by seed weight, which is usually positively associated with seed size (Li et al. 2018; Liu et al. 2020b; Hu et al. 2023a). Cultivated soybean seeds are composed of ~ 20% oil and ~ 40% protein, and both of these traits largely determine soybean seed quality (Lu et al. 2021a; Goettel et al. 2022). Seed size/weight, oil content and protein content are coordinately regulated by genetic factors and environmental signals (Duan et al. 2023). Increasing seed weight, oil content and protein content are important breeding goals. To date, a series of key factors controlling these traits have been identified in soybean, offering valuable targets for molecular breeding design.

Many crops have been domesticated from wild plant species (Doebley et al. 2006). A number of morphological and physiological changes between crops and their progenitors appeared during this process and this phenomenon is regarded as ‘domestication syndrome’ (Gaut et al. 2018). Cultivated soybean was domesticated from wild soybean (Glycine soja) in China approximately 6000–9000 years ago (Kim et al. 2010). Subsequently, cultivated soybean was introduced to East Asia and was later introduced to North America in the 1760s (Zhou et al. 2015). Soybean is now planted worldwide. During soybean domestication, through improvement and regional breeding, the genome diversity decreased significantly. Approximately 50% of the soybean genetic diversity was lost during the transition from wild soybean (π = 2.94 × 10–3) to soy landraces (π = 1.40 × 10–3) (Zhou et al. 2015). A small number of Asian landraces were introduced to North America and became the genetic base of the North American cultivars (Hyten et al. 2006). Understanding the soy genomic variation is therefore beneficial to future molecular breeding and soybean improvement.

Many soybean traits, such as stem growth habit, seed dormancy characteristics, flowering time and stress tolerance, changed significantly during the domestication process. Some domestication-related genes have been identified in recent years. Dt1 encodes a homolog of phosphatidylethanolamine-binding protein, and the single-nucleotide substitutions of this gene, as a result of selective breeding, have influenced stem growth habits (Liu et al. 2010; Tian et al. 2010). The stay-green G gene was identified as a regulator of seed dormancy and has a conserved function in other species (Wang et al. 2018). GmPRR3bH6 represses the expression of GmCCA1a, which acts as a transcriptional activator for J, and overexpression of GmPRR3bH6 delays soybean flowering in natural, long-day (LD) and short-day (SD) conditions (Lu et al. 2017a; Li et al. 2020; Wang et al. 2020b). HSFB2b improves salt tolerance by activating the flavonoid biosynthesis pathway and shows evidence of selective breeding over the course of soybean domestication (Bian et al. 2020). In addition, between wild and cultivated soybean, there have been other noticeable changes in seed traits, such as the shift from small seeds to large seeds, from low to high oil content and from high to low protein content. Interestingly, these traits were often highly correlated in soybean seeds. The protein concentration often shows a negative correlation with seed yield and oil concentration (Bandillo et al. 2015; Zhang et al. 2022). Thus, gaining deeper insights into the regulatory networks and multiple functions of relevant genes is expected to be beneficial to soybean breeding practices.

The release of the soybean Williams 82 cultivar reference genome in 2010 opened the door for studies of soybean functional genomics (Schmutz et al. 2010). Then, the sequencing of the undomesticated soybean IT182932 provided detailed information on genetic variation between wild soybean and cultivated soybean (Kim et al. 2010). With the development of next-generation sequencing technology, more soybean reference genomes have been made available, including those of the cultivars Zhonghuang13, W05 and Jindou17 (Shen et al. 2018a, 2019; Xie et al. 2019; Yi et al. 2022). In addition, the pangenomes of wild and cultivated soybeans have also been constructed, launching a new era of evolutionary and functional genomics studies (Li et al. 2014; Liu et al. 2020c). The collection and analysis of 1298 transcriptome samples has provided a comprehensive view of soybean gene expression (Machado et al. 2020). Systematic analysis of the epigenome elucidated the relationship between DNA methylation and soybean genetic variation (Shen et al. 2018b). Other omics researches, such as proteomics and metabolomics, have also made progress in recent years (Komatsu et al. 2017; Silva et al. 2021), producing data that have facilitated the identification of soybean regulatory genes.

Here, we summarize the soybean functional genes and genetic pathways that regulate soybean seed size, oil content and protein content, with a focus on gene pleiotropism, and discuss the challenges and prospects for future soybean studies.

Control of seed weight and size in soybean

Seed weight demonstrates broad variation in the plant kingdom but exhibits narrow variation within species (Westoby et al. 1996). Seed weight and size play important roles not only in plant fitness and adaptation, but also in yield determination. On the one hand, seed weight is correlated with a series of plant adaptabilities to the environment, such as dispersal mode, plant height, leaf area and stress tolerance (Westoby et al. 1996; Moles et al. 2005). On the other hand, seed weight is correlated with seed number, and both influence seed yield (Liu et al. 2020b).

In angiosperms, seed development is an essential process in the life cycle. This process occurs through double fertilization, which is a unique characteristic of flowering plants. During double fertilization, one sperm cell fuses with the egg cell to generate the diploid zygote, and the diploid central cell is fertilized by the other sperm cell to form the triploid endosperm (Goldberg et al. 1989). The fertilized zygote develops into an embryo during embryogenesis, and the endosperm facilitates seed germination by delivering nutrients to the embryo (Chaudhury et al. 2001). The seed coat, which comes from sporophytic integuments, is also an important component of mature seeds (Li et al. 2019). It offers support and protection for the embryo (Figueiredo and Kohler 2014). Thus, seed development is determined coordinately by zygotic and maternal tissues in addition to environmental signals (Li and Li 2015). In recent years, several signaling pathways that influence seed size, including the ubiquitin–proteasome pathway, G-protein signaling, mitogen-activated protein kinase signaling, phytohormone pathway and transcriptional regulator pathway, have been identified (Li et al. 2019).

In rice, the spikelet hull encloses the embryo and endosperm and determines the final grain size, predominantly by setting the storage capacity and limiting grain growth during the rice development process (Li et al. 2018). The volume of the spikelet hull is regulated by the maternal genotype, and the grain weight is determined maternally under this condition (Li et al. 2019). The rice otubain-like protease WTG1 and RING-type protein GW2 negatively regulate grain weight by affecting spikelet hulls (Song et al. 2007; Huang et al. 2017). OsMKKK10-OsMKK4-OsMAPK6 acts in a module to enhance grain size by promoting cell proliferation (Duan et al. 2014; Liu et al. 2015; Xu et al. 2018). Soybean seed development can be divided into three stages, namely, morphogenesis, maturation and desiccation (Le et al. 2007; Nguyen et al. 2016). The process is accompanied by pod development, which provides a space for the seeds. The soybean endosperm is absorbed by the embryo and disappears during seed development (Goldberg et al. 1989). In mature soybean seeds, cotyledons occupy most parts of the embryo and influence seed size. The seed coat surrounds the embryo and provides nutrition to support seed development (Déjardin et al. 1997). Hence, soybean seed weight/size is controlled coordinately by maternal and zygotic tissues. Some regulatory genes controlling seed size have been identified by studying pod development, such as SoyWRKY15a, while more genes, such as GA20OX, GmCYP78A5 and GmJAZ3, have been found by analyzing RNA-seq data from various seed developmental stages (Lu et al. 2016; Du et al. 2017; Gu et al. 2017; Hu et al. 2023a).

In the past decade, great progress was made in elucidating soybean seed size regulation (Table 1). Overexpression of GmCYP78A72 was shown to promote seed size, both in Arabidopsis and in soybean, whereas silencing of this P450 gene in soybean does not cause obvious phenotypic changes (Zhao et al. 2016). Further silencing of the other two GmCYP78A genes resulted in small soybean seeds, indicating the functional redundancy of this gene family. Upregulation of GmCYP78A5 expression, another cytochrome P450 family gene, also increased seed size and weight (Du et al. 2017). GmCIF1 encodes a cell wall invertase inhibitor, and RNAi of GmCIF1 increases cell wall invertase activities, leading to heavier seeds (Tang et al. 2017). GmSWEET10a is located in a selective sweep region and regulates seed size by transporting sucrose and hexose (Wang et al. 2020a). Its homolog GmSWEET10b exhibits a similar function in increasing seed weight. Downregulation of GmBS1 and GmBS2 results in significantly increased size and weight of plant organs, such as seeds, pods and leaves (Ge et al. 2016). The expression of the GRF5, GIF1, CYCD3;3 and HISTONE4 genes was also shown to be increased in these GmBS-silenced soybean plants compared with the Williams 82 plants. Soybean GmFULa acts as a transcription factor and promotes biomass accumulation and 100-seed weight (Yue et al. 2021). Through coexpression network analysis, GmJAZ3 was identified as a seed development regulator, and the GmJAZ3-GmRR18a-GmMYC2a-GmCKXs module promotes seed size by orchestrating jasmonate and cytokinin signaling (Hu et al. 2023a). The GmJAZ3 Hap3 promoter was selected from wild soybeans and has been fixed in cultivars. The GmCOL2b-transgenic soybean was found to produce larger seeds than the wild type (WT) under both SD and LD conditions, whereas knockout of GmCOL2b produced smaller seeds only under SD conditions (Yu et al. 2023). Compared with mock-inoculated and vector-infected soybeans, soybeans with three GmFAD3 genes (GmFAD3A, GmFAD3B and GmFAD3C) silenced produced both larger and heavier seeds (Singh et al. 2011). In another study, suppressing the expression of four TAG lipases did not alter seed number but did increase seed weight, thereby promoting seed yield in soybean (Kanai et al. 2019). Compared with the WT, overexpression of GmOLEO1 also decreased seed weight, but increased seed pod numbers, which ultimately enhanced seed yield (Zhang et al. 2019). GmPLATZ was identified through analyzing the transcriptomes of developing seeds, and was shown to directly activate the expression of cyclin genes and GmGA20OX to enhance soybean seed size and weight (Hu et al. 2023b).

Table 1 Regulatory genes for soybean seed traits

Owing to the development of sequencing technology and the improvement of the reference genome, a growing number of functional soybean genes have been identified through forward genetic methods. The PP2C-1 allele from wild soybean promotes seed size by enhancing cell size, and PP2C-1 was shown to influence BR signaling by interacting with and dephosphorylating GmBZR1 (Lu et al. 2017b). GmPDAT has also been identified as a seed size regulator, by genome-wide association studies (GWAS), and subsequent research revealed that overexpression of this gene increased seed size, whereas knockdown of the gene by RNAi decreased seed size (Liu et al. 2020a). Compared with the WT, the K83 mutant was shown to have larger and heavier seeds, with GmKIX8-1 being the causative gene producing this mutant (Nguyen et al. 2020). The mutant plants created by CRISPR have shown elevated expression of GmCYCD3;1–10 and increased cell proliferation. The Gmdtm1-1 and Gmdtm1-2 mutants have shorter trichomes, reduced plant height and smaller seeds, and here GmNAP1 was the causative genetic locus responsible for these variations (Tang et al. 2020).

GmST05 was identified by GWAS from over 1800 soybean accessions and was shown to positively regulate seed thickness, length and width (Duan et al. 2022). Promoter variations of GmST05 could influence the expression of this gene, in pods and seeds, to alter final seed weight. Through GWAS and quantitative trait locus (QTL) analyses, the GmMFT gene was also identified and named (Cai et al. 2023). Overexpression of GmMFT increased seed weight, whereas its mutants exhibited decreased seed weight. A 321-bp transposable element insertion in the CCT domain of POWR1 was also shown to influence POWR1 protein localization and to increase soybean seed weight (Goettel et al. 2022). By analyzing the sequencing data of 184 RILs developed from the KF No. 1 and NN 1138-2 and from 211 soybean accessions, GmGA3ox1 was identified as a seed weight regulator (Hu et al. 2022). Knockout of GmGA3ox1 upregulated the net photosynthesis rate and downregulated seed weight by decreasing the GA content. ST1 is a domesticated gene that controls seed size by regulating pectin biosynthesis (Li et al. 2022). Dt2 encodes a MADS-box transcription factor, and compared with the DN50 control, Dt2 knockout mutant plants exhibit more branches, as well as larger and heavier seeds (Liang et al. 2022).

In another study, the expression of PG031 was induced in the seed coat and radical during germination, and these PG031-overexpressing transgenic lines produced smaller and lighter seeds than the WT (Wang et al. 2022a). The sss1 mutant also had altered 100-seed weight by influencing both cell area and cell number (Zhu et al. 2022). This mutant was mapped to the GmSSS1 gene, which encodes a putative O-GlcNAc transferase that regulates the expression of the GmGAI1 and GmBZR1 genes. SW16.1 was identified from a population derived from NN 1138–2 and the wild soybean chromosome segment substitution line CSSL3068 and encodes a LIM domain-containing protein (Chen et al. 2023). Interestingly, the GsSW16.1 allele from wild soybean negatively regulates seed weight, whereas the GmSW16.1 allele from cultivated soybean positively regulates seed weight. Combining the results from linkage analysis and GWAS, GmFtsH25 was identified as a regulator of photosynthesis (Wang et al. 2023). The overexpression of this filamentation temperature-sensitive protein H protease promotes soybean branch number, pod number and 100-seed weight. Transgenic soybean plants overexpressing hsw were able to produce larger and heavier seeds (Wei et al. 2023a).

Control of seed oil content in soybean

Soybean fatty acids (FAs) accumulate over a relatively short period (4–6 weeks) and are generally stored in the cotyledons (Nguyen et al. 2016). Plant FAs in seeds are mainly stored in the form of triacylglycerols (TAGs), and their biosynthesis requires a carbohydrate flux, such as in the form of sucrose from photoautotrophic tissues (Song et al. 2013). In Arabidopsis, disruption of AtSUC5, a sucrose transporter gene, resulted in a decreased FA concentration in seeds (Baud et al. 2005). Lipid accumulation was initiated from the de novo synthesis of FAs from acetyl-CoA, which was generated from pyruvate during glycolysis. Acetyl-CoA carboxylase is the rate-limiting enzyme of FA synthesis and catalyzes the formation of malonyl-CoA from acetyl-CoA in an ATP-dependent step (Harwood 1988; Sasaki and Nagano 2004). The FA synthase complex then catalyzes elongation reactions in plastids (Ohlrogge and Browse 1995; Rawsthorne 2002). FA products are then exported to the endoplasmic reticulum to form TAGs. Many studies have explored this process, and the ‘Kennedy pathway’ has been well studied. In this pathway, FA products are catalyzed by glycerol-3-phosphate acyltransferase, lysophosphatidic acid acyltransferase, phosphatidic acid phosphatase and diacylglycerol acyltransferase (Kennedy 1961; Settlage et al. 1998). These synthesized TAGs are stored in oil bodies, which are surrounded by a phospholipid monolayer embedded with abundant proteins (Tzen et al. 1993; Napier et al. 1996). Most of these proteins were shown to be oleosins, and caleosins, and steroleosins were also identified (Jolivet et al. 2004).

Soybean oil is mainly composed of five FAs, namely, palmitic acid (16:0), stearic acid (18:0), oleic acid (18:1), linoleic acid (18:2) and linolenic acid (18:3). Palmitic acid and stearic acid are saturated FAs that account for approximately 17% of the total FAs in soybean (Demorest et al. 2016). FAT encodes an acyl carrier protein thioesterase, and knockout of GmFATB1a or GmFATB1b reduced the content of two saturated FAs (Ma et al. 2021). The combination of mutations within GmFATB1a and GmFATB1b can result in a more significant decrease in palmitic acid and stearic acid in soybean. Compared with the Forest control, the mutation of soybean GmSACPD-A, GmSACPD-B, and GmSACPD-D, was shown to increase the stearic acid content (Lakhssassi et al. 2020). FAD2 catalyzes the conversion of oleic acid to linoleic acid, and two copies (FAD2-1 and FAD2-2) exist in soybean (Okuley et al. 1994; Lakhssassi et al. 2017).

The targeted mutagenesis of GmFAD2-1A and GmFAD2-1B, using TALENs, could dramatically influence the FA profile (Haun et al. 2014). Specifically, homozygous mutants had increased oleic acid content, but decreased linoleic acid content. The mutation of GmFAD2-1A and GmFAD2-2A, using CRISPR–Cas9, also elevated the oleic acid content, but significantly reduced the linoleic acid content (Wu et al. 2020). The double mutant of these two GmFADs produced a more marked effect than the single mutant. The WT contained 17.3% oleic acid and 59.5% linoleic acid, whereas the highest (65.9%) and the lowest (16.1%) levels of linoleic acid were detected in the GmFAD2-2 mutants (Al Amin et al. 2019). FAD3 is the key enzyme that catalyzes the formation of linolenic acid from linoleic acid (Yadav et al. 1993; Demorest et al. 2016). Silencing of three soybean FAD3 genes positively regulated the linoleic acid content, but negatively regulated the linolenic acid content (Flores et al. 2008). Compared with the fad2-1a fad2-1b soybean, the fad2-1a fad2-1b fad3a mutant plants produced lower linolenic acid (Demorest et al. 2016). Similarly, the three null mutant alleles (for GmFAD3-1a, GmFAD3-1b and GmFAD3-2a) were also shown to produce lower linolenic acid levels than the double mutants (for GmFAD3-1a and GmFAD3-1b) (Hoshino et al. 2014). In addition, knockout of two GmPDCTs, by CRISPR–Cas9, could elevate monounsaturated FAs but reduce polyunsaturated FAs in soybean seeds without influencing plant growth (Li et al. 2023).

Manipulation of related enzymes in lipid metabolism provided novel insight into the regulation of total oil content (Table 1). SDP1 catalyzes TAG degradation for postgermination growth, and a combination of mutations within SDP1 genes contributes to FA accumulation in soybean seeds (Graham 2008; Kanai et al. 2019). PDAT is involved in plant TAG assembly and plays important roles in seed development (Zhang et al. 2009; Fan et al. 2013). In Williams 82, overexpression of GmPDAT positively regulated oil content, whereas RNAi of GmPDAT could negatively regulate oil content (Liu et al. 2020a). DGAT is responsible for the final step of the ‘Kennedy pathway’ and transfers an acyl group from acyl-CoA to diacylglycerol (Bates et al. 2013; Jing et al. 2021). Both GmDGAT2A- and GmDGAT1-2-overexpressing soybean plants were shown to exhibit increased total oil content (Jing et al. 2021; Xu et al. 2021). GmGPDHp1 contributes to the formation of glycerol-3-phosphate, and more oil bodies accumulate in GmGPDHp1-transgenic soybean seeds than in WT seeds (Zhao et al. 2021).

To date, several studies have revealed the regulatory genes involved in controlling total oil content in soybean, and these transcription regulators play essential roles in this process (Table 1). Overexpression of GmWRI1a, an AP2/EREBP family transcription factor, under the control of the 35S promoter significantly enhance seed oil content (Wang et al. 2022b). A transcriptome analysis indicated that the carbohydrate and lipid metabolism pathways were enriched. Another study showed that overexpression of GmWRI1a, driven by a seed-specific promoter, also increased FA content (Chen et al. 2018). GmWRI1a participates in various steps of lipid accumulation by binding to the AW-box of regulatory genes. Upregulating the expression of the GmWRI1b gene, the homolog of GmWRI1a, results in an increase in oil and seed production (Guo et al. 2020). The B3 domain transcription factor, GmLEC2, was shown to enhance the TAG content and to activate the expression of lipid biosynthesis genes (Manan et al. 2017). GmLEC1 is a central regulator of seed development and is also involved in lipid storage (Jo et al. 2020).

While most of the above genes from soybean that have been studied represent homologs from Arabidopsis, several transcription factor genes were first identified in soybean. GmDof4 and GmDof11 positively regulate seed FAs by activating the accD and long-chain acyl-CoA synthetase genes, respectively (Wang et al. 2007). GmbZIP123 has a higher abundance during the lipid accumulation stage, and enhances oil content by promoting the expression of the SUC and cwINV genes (Song et al. 2013). GL2 downregulates oil content by influencing PLDα1 expression, directly, and the MYB transcription factor GmMYB73 associates with GL3 and EGL3 to suppress GL2 expression (Liu et al. 2014). Overexpression of GmDREBL promoted plant lipid accumulation and activated WRI1 expression through binding to its promoter (Zhang et al. 2016).

By establishing cultivar-specific gene coexpression networks, GmNFYA was identified as a hub gene, and its overexpression increased FA content (Lu et al. 2016). GmZF351 was also identified from cultivar-specific gene coexpression networks and increased oil content in transgenic Arabidopsis and soybean (Li et al. 2017). It was shown to promote WRI1 and WRI-regulated gene expression, directly, and has been selected for over the course of soybean domestication. GmZF392 interacts with GmZF351 to synergistically activate gene expression in the oil biosynthesis pathway, enhancing the oil content in soybean (Lu et al. 2021a). GmNFYA promotes the expression of the GmZF351 and GmZF392 genes to regulate soybean seed oil. While affecting seed size, the transcriptional repressor GmJAZ3 decreased seed oil content, likely by influencing sugar transportation (Hu et al. 2023a).

Other genes were also shown to be involved in oil regulation. For example, B1 can suppress expression of the GmWRI1a, GmLEC1a, GmLEC1b and GmABI3b genes in the soybean pod endocarp, which leads to a reduction in oil content (Zhang et al. 2018a). Through QTL and GWAS analyses for oil content, GmOLEO1 was also identified (Zhang et al. 2019) and GmOLEO1 was localized in oil bodies and promoted lipid content in transgenic soybean seeds. Both GmSWEET10a and GmSWEET10b were shown to enhance FA content by influencing seed sugar allocation (Wang et al. 2020a). The expression of GmSWEET39 is positively correlated with oil content in soybean (Miao et al. 2020). GmST05/GmMFT enhances seed oil content by activating the expression of GmSWEET genes (Duan et al. 2022; Cai et al. 2023). POWR1-TE can decrease lipid content by affecting a series of oil metabolism-related genes, such as OLEO1, WRI1a and ABI5 (Goettel et al. 2022). Finally, it is worth noting that many genes pleiotropically affect seed weight and lipid content (Fig. 1).

Fig. 1
figure 1

Schematic overview of seed trait control in soybean. The blue arrows indicate activation, and the black T-ended arrows indicate inhibition

Multiple functions of genes in seed trait regulation

Soybean protein accumulates mainly in the maturation stage of seed development (Le et al. 2007). Storage protein in soybean seeds is composed of 2S, 7S, 11S and 15S proteins, according to their sedimentation properties (Kinsella 1979). β-Conglycinin (7S) and glycinin (11S) are the major components of soybean protein, accounting for 70%-80% of total protein (Singh et al. 2015). These two protein contents are negatively correlated in mature seeds. Glycinin consists of six subunits, each of which is composed of an acidic chain and a basic chain, which are linked by disulfide bonds (Catsimpoolas 1969; Staswick et al. 1981). The subunits are divided into two groups based on physical properties, such as molecular weight and methionine (Nielsen 1985). Group I contains the G1, G2 and G3 subunits, and group II includes the G4 and G5 subunits. Sequence similarities range from 80 to 90% within the groups, but there is less than 50% similarity between the two groups (Nielsen et al. 1989). Other genes encoding glycinin subunits have also been identified in soybean (Beilinson et al. 2002). β-Conglycinin consists of three subunits: α, α’ and β, with isoelectric points of 4.90, 5.18 and 5.66–6.00, respectively (Thanh and Shibasaki 1977; Tsubokura et al. 2012). At least 15 genes have been identified as encoding β-conglycinin and have been shown to be clustered into two groups based on mRNA length (Harada et al. 1989). The 11S proteins may only serve as storage nutrients, whereas the 7S proteins have shown potential roles in regulating plant development (Komatsu and Hirano 1991; Singh et al. 2015). Although the understanding of the storage protein species is relatively clear, the regulation of these components requires further investigation.

In soybean, many genes pleiotropically regulate seed traits, and recent studies have mainly focused on seed weight/size, oil content and protein content (Fig. 1). Some regulatory genes were shown to inversely influence FA and protein accumulation. GmOLEO1, GmSWEET10s and GmST05/GmMFT positively regulate oil content, but negatively regulate protein content (Zhang et al. 2019; Wang et al. 2020a; Duan et al. 2022; Cai et al. 2023). In contrast, GmSDP1s, POWR1-TE and GmJAZ3 decrease the oil content, but increase the protein content in mature seeds (Kanai et al. 2019; Goettel et al. 2022; Hu et al. 2023a). Carbon resources are fundamental substances for both oil and protein synthesis. Oil bodies and protein bodies exist simultaneously in soybean cells (Nguyen et al. 2016). Thus, these two nutrients influence each other by competing for a relatively constant precursor and space. Considering that both oil and protein are components of soybean seeds, alteration of the content of a single nutrient would affect the total seed weight. Many genes controlling oil and/or protein content have been shown to influence seed weight (Fig. 1). It is interesting to note that soybean regulatory genes often affect seed oil content and seed weight positively, while influencing protein content negatively, and vice versa. This phenomenon was in line with the seed trait changes during domestication from wild to cultivated soybeans. The identification of some special genes, such as GmOLEO1 and GmJAZ3, has provided valuable targets for breeding soybean cultivars with larger seeds and higher protein content (Zhang et al. 2019; Hu et al. 2023a).

Abiotic stress and seed traits have also been reported to be correlated, according to some research. Abundant plant nutrients contribute to better survival in hazardous environments. For instance, lipid composition and content influence membrane stability and further regulate stress tolerance (Mikami and Murata 2003; Shi et al. 2008). Overexpression of GmFAD3A increases drought and salinity resistance, and silencing of this gene in soybean caused more sensitivity to drought and salinity stresses than occurred in the WT (Singh et al. 2022). The lipid regulator GmZF351 enhances stress resistance by directly activating the expression of the GmCIPK9 and GmSnRK genes (Wei et al. 2023b). Compared with normal conditions, GmZF351-overexpressing soybeans produced a lower total FA content under salt stress conditions. GmNFYA could promote salt-responsive and oil-related gene expression in soybean (Lu et al. 2021a, b). Soybean seed oil, protein and other nutrients are also affected in response to abiotic stress (Wang and Frei 2011; Wijewardana et al. 2019; Ezzati Lotfabadi et al. 2022). An unfavorable environment usually causes accelerated leaf senescence and shortened seed development in plants (Black et al. 2000; Dupont and Altenbach 2003). Hence, the alteration of seed traits may be due to the influence of leaf photosynthesis and seed filling. However, other factors cannot be discounted.

Conclusions and future perspectives

Soybean seed traits are a combination of a series of quantitative traits that are controlled, pleiotropically, by multiple genes. Although many genes have been verified and their molecular details were elucidated in Arabidopsis and rice, whether these genes function in a similar manner in soybean remains largely unknown. The complex soybean genome has been a main factor hindering the identification of functional genes. The size of the soybean genome is more than 1 Gb, and the majority of genes occur in multiple copies due to genome duplications (Schmutz et al. 2010; Shen et al. 2018a). As a diploidized tetraploid plant, soybean has significant functional redundancy, and knocking out some regulatory genes, singly, usually results in no obvious alteration in phenotypes, whereas overexpressing the corresponding genes may influence phenotypes significantly (Hu et al. 2023a; Wang et al. 2023). This functional redundancy may be because different genes produce the same metabolic outcome or possess overlapping molecular functions (Zhang et al. 2018b). The development of new technologies may contribute to overcoming this adversity within gene paralogs. For example, the transportome-scale artificial microRNA approach, which was designed to target a specific group of redundant members, has successfully identified ABCG transporters with functional redundancy in Arabidopsis (Zhang et al. 2021).

In addition, the soybean regulatory networks are not fully understood, perhaps owing to the limited number of genes whose functions have been confirmed in soybean by transgenic approaches. Seed traits vary in plants, even though the signaling pathways may be relatively conserved. Overexpression of several soybean genes, such as GmJAZ3, GmZF351 and GmZF392, produce similar phenotypes in Arabidopsis and soybean, indicating that the corresponding genes may function similarly in different plant species (Li et al. 2017; Lu et al. 2021a; Hu et al. 2023a). Meanwhile, ectopic expression of Sesamum indicum SiDGAT1 and Arabidopsis AtSINA2 in soybean also was shown to promote higher oil content (Wang et al. 2014; Yang et al. 2023). Additionally, some orthologs of the same gene family play similar roles in soybean and other plants. The CYP78A gene is highly conserved in different plants, and its overexpression positively regulates seed size in Arabidopsis, rice, wheat and soybean (Fang et al. 2012; Ma et al. 2015; Zhao et al. 2016; Maeda et al. 2019). GmJAZ3 and GmPLATZ have been identified as seed weight regulators in soybean, and their orthologs in Arabidopsis and rice have similar functions in regulating seed/grain development (Hu et al. 2023a, b). Hence, it is also beneficial to elucidate genetic regulatory networks for soybean seed traits based on studies in other plant species.

Although some soybean functional genes have been identified, their applications in breeding remain challenging. Knocking negative regulators out with CRISPR–Cas9 has proven to be an effective strategy for soybean molecular breeding to achieve favorable seed traits. For positive regulators, the creation of novel soybean germplasms may be enabled by the elevation of regulator gene expression through editing the promoter regions and by the increase of regulator mRNA stability by editing the UTR regions. The alternation of active sites in the regulatory proteins may also serve as a beneficial approach for functional studies.

Overall, even though a number of genes have been identified to affect seed traits, we still only have very limited knowledge about their modules and pathways (Fig. 1). Whether and how these genes form networks for efficient regulation requires further investigation (Fig. 1). Creating novel soybean germplasms with higher seed weight and yield, as well as with desirable nutrient compositions, is of great importance and should be pursued long term. Regarding the pleiotropic effects related to soybean seed traits, the identification of more genes in the regulatory pathways may enable these traits to be uncoupled and in turn may accelerate soybean breeding programs to produce plants with desirable traits. Further gene functional studies are expected to advance our understanding of the soybean signaling pathways for seed trait regulation and to provide molecular tools for targeted soybean breeding.