Background

Soybean (Glycine max L.) is a major leguminous seed crop providing an important source of oil, and ranks first in oil production among the major oil seed crops [1]. In addition to its usage for human consumption, given its high content of essential fatty acids such as linoleic acid, soybean oil is an important renewable resource for chemical and biofuel production [2]. Increasing the relative oil concentration in soybean seeds is many researchers’ goal, using breeding methods and biotechnological strategies [35]. The key governing genes of seed oil biosynthesis in higher plants are those involved in the synthesis pathway of triacylglycerol (TAG), plastid fatty acids, endomembrane lipids and the storage process, which constitute more than 274 genes in soybean [6]. Several functional essays have shown that DGAT1 and DGAT2 have roles in seed oil accumulation [79]. Mutant alleles of FAD2-1A and FAD2-1B were combined to create soybeans with the high oleic acid trait [10]. Some important transcription factors (TF) regulating lipid metabolism and seed maturation have been reported, including basic leucine zipper (bZIP) and DNA binding with one finger (DOF) TF families [11, 12]. However, most of these genes’ functions are not clear and the biosynthesis of soybean oil has yet to be elucidated.

With the availability of the soybean whole-genome sequence and large-scale application of high-throughput sequencing technology, research on soybean has made great progress. Numerous high-throughput data including genomics, transcriptomics, proteomics and metabolomics data are available for soybean. These data can provide valuable insights and improve soybeans if integrated and analyzed in a novel and comprehensive way. One of these ways is information mining from transcriptome data, given the overwhelming amount of such transcriptome data. The large amount of microarray expression data and deep sequencing transcriptomic data have allowed increasingly credible methods to be developed for generic networks, such as graphical Gaussian model networks through partial correlation [13], reverse engineering cellular networks using the ARACNE algorithm [14], networks based on improved PCC methods [15] and Bayesian networks [16]. In order to overcome the disadvantage of some functionally related genes with low PCC, a new variable ‘mutual rank’ or MR has been introduced [17, 18], and co-expression networks for Arabidopsis and rice have been generated [1921].

Some platforms and web services have been built, (e.g. SoyBase [22], SoyGD [23] and SGMD [24]), and other databases are shown in Table 1. These databases contain diverse information, such as genomic data, expressed sequence tags [24] and microarray expression data. Soybean transcription factors, transposable elements and partial ‘omics’ data are also integrated. Meanwhile, some useful tools have been developed, for instance, gene family browsing, BLAST searching and the gene pathway viewer–all providing good support for soybean research. However, the work on information mining and deep analysis for high-throughput transcriptome data including microarray and deep sequencing data is defective compared to the integration and simple analysis in previous studies in Arabidopsis[25, 26]. As a useful tool, co-expression network was previously built [27], but it is also imperfect due to limitations on the types of microarray expression experiments used in its web services. In addition, although information concerning soybean metabolic pathways can be found in the Plant Metabolic Network (PMN) [28], a complete analysis platform of gene function annotation for the genes involved in the soybean acyl-lipid pathway is still lacking. Therefore, it is necessary to develop a powerful functional mining tool for soybean omics data and use it to predict candidate genes for molecular breeding related to oil.

Table 1 Summary of published soybean databases

Driven by this need, we developed a comprehensive database, the Soybean Functional Genomics Database (SFGD), which provides an integration and analysis platform for soybean ‘omics’ data and is a one-stop-shop resource for soybean acyl-lipid metabolism researchers. We hope it will improve the accuracy and robustness of soybean functional genomics analysis, and further speed up research on soybean yield and quality. This database contains 221 enzymes and more than 1550 genes involved in 15 soybean acyl-lipid metabolic pathways by combining data from the PMN [28] and prediction using Arabidopsis lipid-related genes. It also integrates genome and transcriptome data, e.g. coding genes, full-length cDNA and miRNA sequences, microarray expression experiments, deep sequencing experiments and single nucleotide polymorphisms (SNPs). In addition, a co-expression regulatory network including 23,267 soybean genes represented by 37,593 probe sets (the numbers do not match as one gene can be represented by one or more probe sets) with available microarray expression experiments (255 samples from 14 experiments) and 1873 miRNA–gene pairs in the network is included. There are some other function modules, e.g. a cis-element significance analysis toolbox for gene promoter sequences, an expression pattern search for microarray expression data so users can compare gene expression differences in various growth periods or under diverse stresses. There are also some general functions, such as browse, search and download of data. The platform is freely available at http://bioinformatics.cau.edu.cn/SFGD/.

Construction and content

Data sources

Soybean coding gene data and their annotations were downloaded from JGI (Joint Genome Institute) [6], which mainly included whole-genome sequences, coding gene sequences, protein sequences, gene location information and annotations. Full-length cDNA sequences were downloaded from the Soybean Full-Length cDNA Database [29]. We collected soybean miRNA and their precursor sequences from recently published literature and other repositories [3845], and then mapped these miRNA precursor sequences against the soybean whole-genome sequence using the BLAST program (BLAST-2.2.19). Soybean probe set consensus sequences were downloaded from Affymetrix [46] and their annotations downloaded from B2G-FAR [47]. Microarray expression experiment data (CEL files) were downloaded from the Gene Expression Omnibus (GEO) [17] in the National Center for Biotechnology Information (NCBI) and normalized using Affymetrix GCOS software (TGT value: 500). We downloaded deep sequencing experimental data (fastq files) from NCBI Sequence Read Archive (SRA) with accession numbers SRP002082 [48, 49], SRP002176 [50], SRP002459 [51] and SRP006767; SNP data (with 30-bp upstream and downstream flanking sequences) were collected from NCBI dbSNP, which included 17 wild and 14 cultivated soybean species [52]. Acyl-lipid pathway data were collected from PMN [28] and the ARABIDOPSIS ACYL-LIPID METABOLISM database [53]. The data source information is listed in Table 2. All the plant motifs were downloaded from the Plant Cis-acting Regulatory DNA Elements (PLACE) database [54], PlantCARE database [55], AthaMap database [56] and related publications [57]. There are total of 797 motifs and 66,207 soybean gene promoter sequences deposited in our database. Additional file 1: Table S3 lists 797 motifs and their occurrence frequency in these promoter sequences.

Table 2 Data source of SFGD

Microarray expression experiment normalization

Microarray expression experiment data (CEL files) were downloaded from GEO [17] in NCBI and normalized using Affymetrix GCOS software (TGT value: 500). Then we computed the average intensities ( x ¯ ), standard deviations (stdev) and standard errors (stderr) using the following formulae:

x ¯ = i = 1 n x i stdev = i = 1 n x i - x ¯ 2 n = 1 stderr = stdev n

Where x is intensity value and n is the number of replications.

High-throughput sequencing data pre-processing

We downloaded deep sequencing experiment data from NCBI SRA, and removed the adapter sequence and low-quality bases at the 3’-end used the FASTX-Toolkit [59] if needed. We then checked read quality with FastQC - a quality control application for fastq files (http://www.bioinformatics.babraham.ac.uk/projects/index.html). Sequence reads that passed purity filtering were aligned with the whole soybean genome through bowtie [60] (version: 0.12.5) allowing for zero mismatches. The number of final reads of a DNA sequence was computed by the following formula:

n = log 2 N + 1

Where N is the number of raw reads mapped on the DNA fragment.

For each gene’s deep sequencing evidence of the 30 samples, each sample was normalized using the RPKM value [61].

RPKM = 10 9 × C NL

Where C is the number of mappable reads falling into the gene, N is the total number of mappable reads in the experiment and L is the gene length in base-pairs.

ZFE (Z-score for expression) and ZFM (Z-score for motif)

Z-score is a statistical measurement of the distance in standard deviations of a sample from the mean. Z-scores allow analysts to convert scores from different data sets into scores that can be accurately compared to each other [62], since they can act as a normalization method to eliminate the difference caused by background for a series of different experiments. For this reason, Z-score transformation statistics have been used in aspects of biology research, such as comparing gene expression between experimental and control groups in microarray analysis [6366].

ZFE: we found that ZFE reflected the relationship of transcriptome change more accurately and visually than expression values (Additional file 2: Figure S2). Here we show the method used to generate the ZFE values from the average intensity values.

For each probe set in an experiment, if one of its average intensity values was greater than the Average Signal (A) (produced by GCOS) and max (average intensity)/min (average intensity) ≥ 2, and for datasets GSE7511, GSE7881, GSE8112, GSE9374, GSE10251 and GSE15100, we generated the ZFE using formula (1); for datasets GSE8432, GSE9730, GSE12300, GSE12314, GSE139631, GSE17883, GSE18822 and GSE20972, we calculated the ZFE using formula (2); otherwise ZFE = 0 [67].

ZFE = x ¯ x ¯ ¯ stdev
(1)
ZFE = log 2 x T ¯ x c ¯
(2)

Where x ¯ ¯ is the average value of all x ¯ in an experiment, x T ¯ is x ¯ of the treatment samples and x c ¯ is x ¯ of the control samples.

ZFM: for one motif or set of motifs (in a gene or list of genes) submitted by users, the ZFM and P-value of each motif is calculated using the following method [68]. Initially, genome-wide genes are divided into several categories by promoter region length, e.g. genes with promoter region lengths < 500 bp, 500-1000 bp, 1-2 kb, 2-3 kb; secondly, ‘m’ submitted genes are classified with the same rules; thirdly, 1000 surrogate sets of ‘m’ promoters are obtained from groups with different promoter length, with sampling of the same proportions; and finally, for the motifi, the number of motifi occurring in the promoters is Nmotif_i, the average occurrence frequency of motifi in these 1000 sets is meanmotif_i, and the standard deviation is stdevmotif_i. The ZFM and P-value of motifi are calculated only if Nmotif_i > meanmotif_i using the following formula:

ZFM = N motif _ i - mean motif _ i stdev motif _ i
P value = 1 pnorm N motif _ i , mean motif _ i , stdev motif _ i

Where pnorm() is the distribution function for the normal distribution in the R package, and the P-value cutoff is 0.1, which is decided with reference to the soybean genetic motif related experimental result, as well as the annotation of motifs in PLACE [54].

MR

The MR [18] method was first used to construct a gene co-expression network in Arabidopsis[20]:

MR = ab

For two genes (or probe sets) X and Y, first all microarray expression values are used to respectively calculate all genes’ PCC values for X and Y. Then PCCs are respectively sorted from large to small for X and Y; then ‘a’ (start from 1, with step size of 1) is used to mark the position of Y in X’s list, and similarly, ‘b’ marks the position of X in Y’s list. For the hub gene, the network shows the top 10 co-expressed genes according to the arrangement of MR from small to large in the first level; for the second level, the top five genes that co-expressed with first level genes are selected; and the selection of genes in the third level follows the same rules as for the second level. The threshold of MR is in reference to the already published Arabidopsis co-expression network [19].

miRNA:target alignment criteria

In plants, miRNAs generally precisely direct their mRNA targets for endonucleolytic cleavage [6971]. To predict potential targets of the miRNAs in this study, we developed a set of computational ‘criteria’ for soybean miRNA-target interaction based on previous research [69, 72]. The ‘criteria’ as following: (a) The mismatch of miRNA:target must be ≤ 4 and the mismatch in the first half-pair must be ≤ 2.5 (G:U = 0.5); (b) The continuous mismatch must be ≤ 2 and continuous mismatch between 2 and 12 bp must be ≤ 1; (c) The bases 11 and 12 must be a perfect match; and (d) The minimum free energy (MFE) of miRNA:target must be < 75% compared with when the miRNA perfectly matched with its target. With these criteria, we successfully predicted 1873 miRNA-gene pairs.

Utility and results

SFGD uses Generic Genome Browser (Gbrowse) as an integration platform for soybean transcriptome data (Table 2) and generates some data mining tools, such as ‘gene co-expression network’, ‘pathway’ and ‘motif’. The tool ‘gene co-expression network’ aims to display a gene’s co-expressed genes and miRNA:target interaction; ‘pathway’ is a functional tool containing soybean acyl-lipid metabolic data; ‘motif’ is a tool for searching and analyzing significant cis-elements in one or more genes; and ‘pattern’ is a tool for searching tissue/time specific genes based on microarray expression experiments (Additional file 3: Table S1), and some other function modules such as BLAST and general search.

Gbrowse: a repository of the transcriptional information

Gbrowse is a genome viewer. Soybean protein coding genes, full-length cDNAs, miRNA precursors, probe set consensus sequences, microarray expression experiments and deep sequencing data have been integrated into Gbrowse with clickable links to a new webpage. This webpage shows descriptions of each keyword: e.g. ‘gene detail information web page’ shows the co-expression gene network, microarray expression profile, deep sequence evidence and coding sequence; for ‘microarray expression data’, a diagram displays the other probe sets related to it with PCC > 0.7 or PCC < -0.65.

The coding gene’s result page includes gene annotation, the list of co-expressing genes, 14 microarray expression experiments [only if there is a probe set(s) in its context] and deep sequence evidence of it in the 30 deep sequencing samples. Following are the gene’s genome location information and sequences (including promoter, cDNA and protein sequences).

For the microarray expression profile section (e.g. ‘GmaAffx.88235.1.S1_at’), we have integrated 14 experiments in total. Each has horizontal bars according to the experiment’s treatment and control expression level generated using ZFE (red bars represent up-regulated and green bars down-regulated, Additional file 4: Figure S1). For example, if users click horizontal bars below ‘GSE7511’ (‘Expression data from soybean seed compartments with embryos at the heart stage’), it will show gene Glyma15g34770’s (represented by probe set ‘GmaAffx.88235.1.S1_at’) expression profile in different tissues of soybean seed. With ‘GmaAffx.88235.1.S1_at’ is included a diagram generated using expression values and standard errors, values table display and the probe sets most positively and most negatively correlated (PCC > 0.7 and PCC < -0.65, respectively). Users can get all these probe sets’ expression values and line charts produced by expression values and ZFEs (Additional file 2: Figure S2) and send the probe set list to agriGO [73] for gene ontology (GO) enrichment analysis. Users can also get the promoter sequence of genes that are positively correlated with this gene and scan the cis-elements contained for significance analysis.

miRNAs are ~21-nt-long endogenetic non-coding small RNAs that function as post-transcriptional regulators in eukaryotes [74]. All miRNAs in SFGD came from recently published literature [3845]. Here, soybean miRNA information is integrated into PMRD (Plant MicroRNA Database) format [58], which mainly includes four regions: precursor, mature, target and reference regions.

The repository also has 30 RNA-seq samples that cover various tissue types and stress conditions. This can be used for deep sequencing evidence of one gene’s expression in particular tissues or at particular time points.

For other content in Gbrowse–such as full-length cDNA, probe set consensus sequence and SNP information–users can obtain their description and related information by clicking the corresponding icons.

Gene co-expression network

We generated a soybean gene co-expression regulatory network based on 255 samples from 14 microarray expression experiments using PCC and MR [18]. After converting the probe sets to the corresponding genes, we obtained a gene co-expression network including 23,267 genes represented by 37,593 probe sets (one gene may be represented by one or more probe sets).

There are 240,496 edges left after setting the MR threshold ≤ 30, giving 20.7 edges for each gene on average–Additional file 5: Table S2 lists the top 10 genes with the greatest number of edges as well as the genes’ annotations. Cytoscape Web software [75] is used to display the gene network, and for each gene (central node), all its co-expression genes are sorted according to MR values and several with the top values are displayed. There is a recursive search using this process for the second and third levels. For the first level [genes (subprime nodes) immediately linked with the central gene], the top 10 MR-value genes are displayed. For the second level (genes directly linked with first level genes, linked with central node through subprime nodes) the top 5 MR-value genes for each subprime node are shown. In the third level, although we collected all top 10 MR-value genes of second level genes, only those which have previously displayed in the first and second levels remain, because other genes are not likely to be closely related to the central gene. A table is also generated to show all genes and gene annotations appearing in the Cytoscape Web file–users can use these genes to scan their promoter motifs and analyze their significance using the textbox at the bottom of the web page.

The soybean miRNA: target information was also integrated into the network. Soybean miRNA:target pairs were predicted using in-house programs with all miRNAs and mRNA genes deposited in our database, then we compared our results with psRNATarget [76]. Finally we separated the results into two parts: high credibility miRNA:targets (338 pairs, predicted by both programs) and low credibility miRNA:targets (1535 pairs, predicted by one of the programs).

Acyl-lipid metabolic pathways

Due to the importance of oil crops, the need for acyl-lipid metabolism research is self-evident. Through integrating data from PMN [28] and prediction using Arabidopsis lipid-related genes, we obtained 221 enzymes and more than 1550 genes involved in 15 soybean acyl-lipid metabolism pathways (Table 3). The pathway card page provides a user-friendly view of lipid metabolic pathways and information about metabolism alias names, mass-to-charge ratios and chemical formulae. Users can also search for their lipid of interest on the lipid page.

Table 3 Soybean acyl-lipid metabolism pathway

Motif scan

If the user’s input is a fasta format sequence of a promoter, all motifs appearing in the sequence are listed, and these motifs’ occurrence frequencies counted using the in-house program.

Cis-element significance analysis

For each gene’s promoter sequence or a sequence submitted by the user, all the potential motifs are scanned. If a list of gene names is submitted, all motifs appearing in their promoter region will be scanned. The platform also offers the result of significant motifs in promoters of these genes.

Expression pattern

In order to find genes’ specific expressions under specific conditions, we selected four experiments with different tissues and stress conditions. GSE7511, which contains 23 samples to identify all genes active in 10 different compartments, was isolated using the Leica AS LMD system from heart-stage seed. GSE7881, includes 18 samples, identified all genes active in cotyledon-stage seed within nine isolated compartments. GSE8112, contains 34 samples obtained from early maturation stage of seed, which was isolated within 17 compartments. GSE8432 contains 27 samples obtained from a soybean plant PI200492 with two treatments: HW94-1 (Phakopsora pachyrhizi isolates, which produces a resistant reaction in the host) or TW72-1 (P. pachyrhizi isolates, which produces a susceptible reaction in the host) at four time points (6, 12, 24 and 48 h post inoculation). Furthermore, we developed two function tools in the corresponding web page: ‘Expression pattern search’ and ‘Tissue/time specific search’.

For the ‘Expression pattern search’ function, selections submitted by users are converted to corresponding values – high (or up-regulated), even and low (or down-regulated) represented by 1, 0 and -1, respectively–then PCC is calculated between users’ selections and ZFE of data in our database, using a PCC threshold of 0.7.

The ‘Tissue/time specific search’ toolbox only returns probe sets that are highly expressed in the specific condition according to the ZFE set up by users. The resulting web pages for both ‘Expression pattern search’ and ‘Tissue/time specific search’ also include links to ‘agriGO’ [73] for GO enrichment analysis and ‘motif scan’ function for cis-element significance analysis as described above.

Here, we selected the biosynthesis pathway of triacylglycerol (TAG) as an example to introduce the application of our database. Figure 1A shows the biosynthesis pathway of TAG and Figure 1B shows the tissue-specific expression of genes that encode the enzymes involved in this pathway–all these data can be obtained from the SFGD platform. Additional file 4: Figure S1 shows a screenshot of the web page with brief information (a coding gene, the corresponding probe set, GmaAffx.88235.1.S1_at and one microarray expression series, and ten deep sequence samples). Figure 1C shows the co-expression network of WRI1 (Glyma15g34770), which has been reported to regulate the process of TAG biosynthesis by affecting the synthesis precursors in developing embryos [77, 78]. Table 4 lists all soybean coding genes that appeared in Figure 1C and their annotations, and also gives these genes’ orthologs in Arabidopsis. Users can also do cis-element significance analysis, and a job ID is produced by the server–the results are stored in the server for several weeks so that they can be traced back. The resulting web-page only shows motifs with P < 0.05, and Table 5 is one example of the significance analysis results for all genes that appeared in Table 4. Table 5 shows that some motifs may be closely related to seed development, such as ‘ABREBNNAPA’ and ‘ABRETAEM’, with their annotations ‘napA; storage protein; ABRE; napin; seed’ and ‘ABA; ABRE; EMBP-1; seed’, respectively.

Figure 1
figure 1

The tissue-specific gene expression and regulation network of the triacylglycerol (TAG) biosynthesis pathway. Multiple genes of interest in the TAG biosynthesis pathway (A) are simultaneously shown in a heatmap (B) with the same color. The co-expression network of WRI1 (Glyma15g34770) (C), which regulates the biosynthesis of TAG.

Table 4 Soybean lipid biosynthesis related genes (WRI1 network genes)
Table 5 Motif significance analysis results of soybean triacylglycerol biosynthesis related genes

More function modules

Home page: this describes the whole database project and other soybean resources.

General search page: SFGD supports fuzzy queries by one or a list of gene names, flcDNA IDs, probe set IDs or function keywords. Users can convert gene names to corresponding probe set names by input of a list of gene names, and vice-versa.

BLAST search: a BLAST search toolbox is provided and users can do BLAST searches against soybean cDNA, flcDNA, probe set consensus and protein sequence.

Download: users can download all coding genes’ sequences and their annotations. Full-length cDNA sequences, consensus sequences and annotations appear in the database.

The SFGD was constructed using Hypertext Markup Language (HTML), perl CGI (http://www.perl.com) and the MySQL 4.0 (http://www.mysql.com) database engine. Figure 2 shows an overview of our websites. The whole SFGD database is run on a server using the LINUX operating system.

Figure 2
figure 2

Structure of the SFGD database. Rectangles with rounded corners are pages in the database, and the directed lines show linkage for pages.

Discussion

There is no doubt that a novel and comprehensive method of integration and analysis for soybean omics data will provide valuable insights and help to improve soybean. However, a powerful mining tool for omics data is limited. To fill the gap of a platform for acyl-lipid comprehensive analysis and information for high-throughput omics data, we developed SFGD–a comprehensive and integrated database for mining functional information from soybean transcriptome data and identifying acyl-lipid metabolism pathways. This database include a variety of information, for instance, microarray expression data with 255 samples from 14 experiments, 30 mRNA-seq samples belonging to four experiments, other genomic data including coding and non-coding genes (miRNA), as well as a set of SNPs from 17 wild and 14 cultivated soybean sub-species [52]. By integration and analysis of these data, we provide a one-stop-shop resource for acyl-lipid researchers, we also integrated the genome and transcriptome information into Gbrowse, which can provide the tissue and time points for specific expression information. We then built some function toolboxes: e.g. gene co-expression network, cis-element significance analysis tool, expression pattern search module and some other functions.

As an example of TAG biosynthesis in the previous description (Figure 1A), Gbrowse provides gene expression information on various tissue types and stress conditions. From these data, we can infer the main place, time or possible function for a specific enzyme (Figure 1B).

The biosynthesis of TAG occurs at the endoplasmic reticulum and probably also involves reactions at the oil body [79]. The classical pathway of TAG biosynthesis in seeds is the acyl-CoA dependent Kennedy pathway (or the glycerol phosphate pathway)–this pathway consists of sequential acylation and subsequent dephosphorylation of glycerol-3-phosphate (G3P). The first acylation of G3P is catalyzed by glycerol-3-phosphate acyltransferase (GPAT; EC 2.3.1.15). GPAT was first cloned from Saccharomyces cerevisiae; and a member of the Arabidopsis GPAT gene family displayed GPAT activity when expressed in yeast. However, mutations of GPAT gene in Arabidopsis do not appear to affect seed oil level, but play a role in production of cutin and suberin [8082]. In our search through the transcriptome profiling data, we found few GPAT family genes were highly expressed in seed at the development stage. Of course, this does not suggest that there is no relationship between GPAT family and seed oil level. Genes encoding GPAT9 are most likely to be important in soybean oil synthesis and have significantly higher expression in seeds at different stages. Similar results were also reported in castor [83].

The second acylation is catalyzed by 1-acylglycerol-3-phosphate acyltransferase (LPAAT; EC 2.3.1.51). A number of LPAAT genes are highly expressed in the dry whole seed, suggesting involvement in the storage rather than the synthesis stage of oil metabolism.

Phospatidate phosphatase (PP; EC 3.1.3.4) catalyzes the dephosphorylation of phosphatidic acid to form sn-1,2-diacylglycerol [84]. A variety of PP types and isoforms exist in plants but their exact role in TAG biosynthesis is unclear. In our analysis results, we found that some PP genes were highly expressed in dry seeds–e.g. PHA1 (Glyma19g35930), PHA2 (Glyma 17 g36130, Glyma06g04230) and LCBCP (Glyma19g39180)–these genes will be candidates for further study.

The final acylation reaction, converting diacylglycerol (DAG) to TAG is catalyzed by enzymes. In Arabidopsis, DGAT1 has been shown to play a role in seed oil accumulation [8587], and has also been reported as a key enzyme determining oil content and composition in maize [88]. In our database, most DGAT genes were highly expressed in seeds and especially dry whole seeds. Therefore, we predict that DGAT is a crucial enzyme for oil production of soybean–a suggestion also supported by previous study [7].

In acyl-CoA independent TAG synthesis pathway, lysophosphatidylcholine acyltransferase (LPCAT; EC 2.3.1.23) activity controls the regeneration of phosphocholine (PC) from lyso-PC. In our dataset, two members are highly expressed in floral buds and early-maturation seed, suggesting that the enzyme plays a role in early stages of oil synthesis but not during the storage phase.

DAG can also be acylated using PC as the acyl donor by a phospholipid:diacylglycerol acyltransferase (PDAT; EC 2.3.1.43). PDAT activity has been discovered in yeast and plants’ developing oil seeds [84]. PDAT1 and DGAT1 have overlapping functions in TAG biosynthesis in developing seeds, and the absence of DGAT1 is evidently compensated by PDAT1. However, the degree that each enzyme contributes to TAG biosynthesis in developing seeds is unknown. In our dataset, some PDAT genes are highly expressed in early maturation-stage seed, and others are highly expressed in seedlings and leaves, suggesting that in other tissues, PDAT-acylated TAG synthesis plays an important role.

DAG can also be converted to phosphatidylcholine (PtdC) via the action of phosphatidylcholine:diacylglycerol cholinephosphotransferase (PDCT) or sn-1,2-diacylglycerol:cholinephosphotransferase (CPT). It has been reported in soybean seeds that about 60% of newly synthesized acyl chains directly incorporate into the sn-2 position of PC through an acyl-editing mechanism rather than a pathway for sequential acylation of G3P [89]. PDCT has a clear seed-specific expression, but DAG-CPT does not have a clear tissue-expression–this does not immediately suggest a major role for DAG-CPT in tri-ricinolein synthesis as reported in castor [83].

In our gene co-expression regulation network, we use PCC and MR [18] values to mark the relationship between each gene (probe set). Many transcription factors are involved in a complex network to regulate TAG production [90, 91]. The mutants for WRINKLED1 produce wrinkled seeds with severe depletion of TAGs [77, 78], which can also be triggered by other transcription factors, such as LEC2, FUS3 and ABI3 [92, 93]–with our co-expression network tool, their co-expression interaction can also be found. In addition, some other transcription factors were detected in this co-expression network (Table 4), such as AREB3, NF-YB6 and MYB65, suggesting that they may participate in the regulated network for TAG synthesis. We also compared our results with the soybean gene network in PlaNet [27] (Additional file 6: Table S4). The Glyma15g34770 network in PlaNet lists 37 genes co-expressed with WRI1, and there are 33 genes in our network, and seven of them overlap with results from PlaNet, and form the ‘band 7 family’, ‘LACS9’, ‘DREB2A’, ‘AREB3’, ‘CYP76G1’ and so on. There are some differences between these two web sites, mainly due to the diversity of microarray expression experiments used in these two web services: e.g. PlaNet’s microarray expression experiments are all tissue-specific treatments (Additional file 7: Table S5, these treatments from the top three microarray expression experiments appear in Additional file 3: Table S1), and we included some other experiments, such as time series experiments, gene mutant experiments and plant responses to biotic and abiotic stresses.

Table 5 shows the significance analysis results of motifs for all genes that appeared in Table 4, and some keywords have a close relationship with seed development, such as motif ‘ABREBNNAPA’, which has been reported as conserved in many storage-protein gene promoters of seeds [9497] and motif ‘ABRETAEM’ play a role in seed gene expression and response to ABA [98, 99]. This information gives some indication of interaction between motif and function.

The advantages of microarray expression experiments are the maturation of this technology, high-throughput and many types of analysis software. There are some disadvantages of microarray experiments in that they do not cover all soybean genes, microarrays cannot identify new genes and there may be errors when expression levels are close to background signals. Thus the next deep sequencing experiments may play important roles in constructing the gene network.

Conclusions

Currently we have integrated soybean genome data, full-length cDNA, microarray expression experiments, deep sequencing data, miRNA precursor and SNP information into our database (Table 2). We have also developed a soybean gene co-expression regulatory network web service, a graphical display of acyl-lipid metabolic pathways, a cis-element significance analysis toolbox, an expression pattern for positive/negative search function modules and other general tools in the database. We hope this will improve the accuracy and robustness of soybean functional genomics analysis, and further hasten understanding of the gene regulatory networks for effective crop improvement. SFGD is freely available at http://bioinformatics.cau.edu.cn/SFGD/, and it will be updated every 3-6 months with the development of soybean research; however, the update must be performed by the administrator.

Availability and requirements

The database is available at http://bioinformatics.cau.edu.cn/SFGD/ and is compatible with most modern web browsers. The user’s browser must have JavaScript enabled to show query examples and Cookie and Flash to show the expression curves.