Introduction

Over the last few years, transcriptome coexpression analysis has become a routine method for functional genomics studies in Arabidopsis. In this analysis we predict the function of genes on the basis of a simple assumption that a set of genes involved in a particular biological process can be coexpressed under the control of a shared regulatory system. In other words, if a gene of an unknown function is coexpressed with a set of genes involved in a particular biological process, it can be assumed to be one of the components of the same biological process. The development of comprehensive methods to measure mRNA accumulation such as DNA array and bioinformatics tools for handling large-scale datasets has enabled transcriptome coexpression analysis based on hundreds of transcriptome data. Individual biologists perform transcriptome analysis using DNA arrays to find answers to specific questions, such as how gene expression patterns change in plants under specific conditions of interest. DNA array data thus obtained are deposited in public databases. On the other hand, DNA array data has been systematically acquired by the AtGenExpress (Goda et al. 2008; Kilian et al. 2007; Schmid et al. 2005) and NASCArrays (Craigon et al. 2004) by using the same analytical platform, i.e., Affymetrix GeneChip microarray. This led to the development of secondary databases equipped with web-based coexpression analysis tools that help in calculating and storing the information regarding the level of similarity of gene expression patterns, and this information is made available to users.

We have analyzed the transcriptome of nutrient-starved Arabidopsis since 2001. During the data-mining of the in-house dataset obtained in our lab, we found that coexpression analysis is a powerful technique to identify candidate genes involved in glucosinolate (GSL) biosynthesis. More recently, as mentioned above, the development of web-based analytical tools has enhanced the predicting power of transcriptome coexpression analysis. In this review I describe briefly the methodology of coexpression analysis and discuss its advantages and disadvantages of this analysis in the context of its ability to predict gene functions involved in GSL biosynthesis. For a general review of coexpression analysis and network representation of coexpression relationship, please refer to other reviews (Aoki et al. 2007; Saito et al. 2008). In order to avoid any overlap with other reviews in this special issue, I have not discussed the details of the characterization of gene functions.

A brief overview of coexpression analyses

In coexpression analysis the degree of similarity of gene expression patterns across a variety of experimental conditions is evaluated by calculating the similarity between pairs of genes using statistical measures such as Pearson’s correlation coefficient (PCC). Both in-house datasets and publicly available datasets can be utilized for calculation of similarity, although the results obtained would differ. In-house transcriptome data are often obtained under specific condition of interest (e.g. sulfur-starvation condition in my study), and hence higher similarity of expression pattern indicates that the coexpression relationship occurs only under the specific condition of interest (e.g. coexpression under sulfur-starvation), that is, condition-dependent coexpression relationship. In contrast, thousands of transcriptome data available in the public databases have been obtained under a wide range of experimental conditions, and hence the higher similarity coefficient calculated on the basis of publicly available dataset indicates a constitutive or condition-independent coexpression relationship; that is, a set of genes with higher similarity are coexpressed across a variety of experimental conditions. Many coexpression analysis tools have been recently released, such as ATTED-II (the Arabidopsis thaliana trans-factor and cis-element prediction database) (Obayashi et al. 2007), CSB.DB (the Comprehensive Systems-Biology Database) (Steinhauser et al. 2004), BAR (the Botany Array Resource) (Toufighi et al. 2005), ACT (the Arabidopsis Co-expression Tool) (Jen et al. 2006; Manfield et al. 2006), Genevestigator (Zimmermann et al. 2005; Zimmermann et al. 2004), PED (Plant Gene Expression Database) (Horan et al. 2008) and CressExpress (Srinivasasainagendra et al. 2008). Some of these provide users the option of calculating the degree of similarity for a dataset. When a whole dataset (e.g. all data obtained by AtGenExpress) is selected for analysis, the constitutive coexpression relationship is elucidated. However, by selecting a subset of dataset (e.g. developmental-series, stress-series, or hormone-treatment-series data of AtGenExpress), condition-dependent, or context-specific coexpression relationship can be determined, as is the case with the coexpression analysis using in-house datasets.

Coexpression analysis has several advantages in predicting gene functions: (1) Researchers do not need to conduct “wet” experiments in order to predict the function of unknown genes of interest. The coexpression relationship of these genes with genes of known function, as well as the sequence similarity between the 2 sets of genes, will provide clues to predict gene function. (2) Researchers can identify the components involved in a particular biological process. Apparently, however, coexpression analysis does work for this purpose only when a complete biological process is coordinately regulated at the level of mRNA accumulation. (3) Even if the knockout of genes belonging to a gene family, the members of which have unknown biological function, fails to reveal any apparent phenotype, the function of these genes can be predicted on the basis of their coexpression relationship with other genes (Rautengarten et al. 2005). (4) Coexpression analysis can even be conducted using a non-targeted approach without any preexisting hypothesis. In other words, coexpression relationship can often be determined from a set of transcriptome data irrespective of the original purpose of the experiments by which the data were obtained.

Prediction of the genes involved in glucosinolate biosynthesis – a case study of coexpression analysis

In this section I briefly describe the transcriptome analysis of nutrient-starved Arabidopsis conducted in our lab. As mentioned above, during the course of our study, we realized that coexpression analysis is considerably useful for identifying candidate genes involved in GSL biosynthesis.

In order to understand the plant’s response to sulfur deficiency by omics-based approach, we conducted an integrated analysis of the transcriptome and metabolome of sulfur-starved Arabidopsis (Hirai and Saito 2008; Hirai et al. 2004, 2005). Time-series data for the transcriptome and the metabolome of leaves and roots were obtained, and analyzed by batch-learning self-organizing mapping (BL-SOM), a sophisticated form of multivariate analysis (Abe et al. 2003; Kanaya et al. 2001). BL-SOM, along with other clustering algorithms such as k-means and hierarchical clustering, can be used for co-occurrence analysis of genes and metabolites. When BL-SOM is applied to transcriptome and/or metabolome data, the genes and/or metabolites can be classified into the cells on a 2-dimensional lattice called a feature map on the basis of the similarity of expression and/or accumulation patterns. In this analysis, we defined a set of co-occurring genes and/or metabolites as a cluster. We identified many clusters, for example, a set of the genes involved in anthocyanin biosynthesis and a set of those involved in sulfate assimilation (Hirai et al. 2005). Several Met- and Trp-derived GSLs were classified into a single cluster, suggesting that GSL metabolism is coordinately regulated under sulfur deficiency. This idea was supported by the finding that the known GSL biosynthetic genes—the MAM (methylthioalkylmalate synthase), CYP79 and CYP83 families, SUR1 and AOP2—were classified into another single cluster. This indicated that GSL biosynthetic genes are coexpressed under sulfur deficiency probably via a shared regulatory mechanism. On the basis of the coexpression relationship with the previously-characterized genes mentioned above, we identified the following genes as candidates involved in GSL biosynthesis: three putative sulfotransferase genes (AtSOT16/At1g74100, AtSOT17/At1g18590, and AtSOT18/At1g74090), an S-glucosyltransferase gene (UGT74B1/At1g24100), a putative Tyr aminotransferase gene (At5g36160), and two putative glutathione S-transferase (GST) genes (GSTF11/At3g03190 and GSTU20/At1g78370) (Hirai et al. 2005). To date, some of these candidate genes have been characterized experimentally. The predicted functions of the AtSOTs and UGT74B1 have been confirmed by concurrent studies (Hirai et al. 2005; Piotrowski et al. 2004; Douglas Grubb et al. 2004).

In the same analysis using in-house dataset, we identified several genes encoding transcription factors, including Myb28 (At5g61420) and Myb29 (At5g07690), as the candidate positive regulators of GSL biosynthesis. We also analyzed constitutive coexpression relationship by ATTED-II (Obayashi et al. 2007) using a whole dataset of AtGenExpress (1,388 ATH1 arrays), and found that Myb28 and Myb29 were coexpressed only with the genes involved in Met-derived GSL biosynthesis. The known Met-derived GSL genes were highly coexpressed with Myb28, but to a lesser extent with Myb29. This analysis suggested that Myb28 and Myb29 may be transcription factors positively regulating Met-derived GSL biosynthesis, but not Trp-derived GSL biosynthesis. Reverse-genetic and molecular biological experiments have proved Myb28 to be a key transcription factor that positively regulates Met-derived GSL biosynthesis and Myb29 to be a transcription factor probably involved in methyl jasmonate-mediated induction of GSL biosynthesis (Hirai et al. 2007). Concurrently, several groups have independently found that Myb28, Myb29 and Myb76 (At5g07700) are the positive regulators of Met-derived GSL biosynthesis and that Myb51 (At1g18570) and Myb122 (At1g74080), as well as previously-characterized Myb34 (At5g60890), are the positive regulators of Trp-derived GSL biosynthesis (Beekwilder et al. 2008; Gigolashvili et al. 2007ac; Sonderby et al. 2007; Malitsky et al. 2008). These authors have discussed the specific functions of individual Mybs, the mutual regulation among these Mybs and the mutual regulation between Met- and Trp-derived GSL pathways (see other reviews in this issue).

In our analysis, AtBCAT-3 (At3g49680) and AtBCAT-4 (At3g19710) were also coexpressed with the Met-derived GSL biosynthetic genes of known function (Hirai et al. 2007), suggesting the involvement of these genes in Met side-chain elongation. The function of these genes has recently been confirmed, and AtBCAT-3 was shown to function in both GSL and amino acid biosynthesis (Knill et al. 2008; Schuster et al. 2006). We also identified other candidate genes involved in Met-derived GSL biosynthesis, although these predicted functions remain to be confirmed: AtGSTU20, AtGSTF11, PMSR2 (At5g07460), and the homologs of bacterial Leu biosynthetic genes named AtLeuC1 (At4g13430), AtLeuD1 (At2g43100), AtLeuD2 (At3g58990), and AtIMD1 (At5g14200). With regard to the GST genes, it has been suggested that GST-type enzymes may be components of an enzyme complex formed by CYP83s and C-S lyase (Mikkelsen et al. 2004). The PMSR2 gene encodes a cytosolic peptide methionine sulfoxide reductase. Because a null mutation in this gene resulted in reduced growth in Arabidopsis under short-day conditions, it was hypothesized that the role of PMSR2 is to repair oxidized proteins in a short-day photoperiod (Bechtold et al. 2004). We speculate that the PMSR2 protein can recognize the methylsulfinyl moiety of methylsulfinylalkyl GSL as well as that of peptide methionine sulfoxide, and that hence, this enzyme may have some function in the side-chain conversion of Met-GSLs, although FMOGS-OX has been shown to be responsible for the conversion of methylthioalkyl GSLs to methylsulfinylalkyl GSLs (Hansen et al. 2007). We assumed that the homologs of Leu biosynthetic genes are involved in Met side-chain elongation for the following reason. The reactions involved in Met side-chain elongation are similar to those involved in Leu biosynthesis; moreover, the enzymes involved in Met side-chain elongation and Leu biosynthesis are presumably encoded by homologous genes belonging to the same gene families. In fact, MAM genes and IPMS (isopropylmalate synthase) genes, which are responsible for Met side-chain elongation and Leu synthesis, respectively, share sequence similarity with each other and with bacterial IPMS (de Kraker et al. 2007; Field et al. 2004; Kroymann et al. 2001). All of the above-mentioned candidate genes are under the transcriptional regulation involving Myb28 (Hirai et al. 2007). In addition, UGT74C1 (At2g31790), which is assumed to be involved in Met-derived GSL biosynthesis on the basis of the coexpression analysis (Gachon et al. 2005), is positively regulated by Myb28 (Hirai et al. 2007). On the other hand, a putative Tyr aminotransferase gene mentioned above is not regulated by Myb28 (Hirai et al. 2007), suggesting that it may encode a C-S lyase involved only in Trp-/Phe-derived GSL biosynthesis. The reason for this assumption was that the C-S lyase gene SUR1 had been originally misannotated as a Tyr aminotransferase. Another possibility is that this gene may encode a Phe aminotransferase. Arabidopsis ecotype Columbia contains 2-phenylethyl GSL derived from homoPhe. If homoPhe is formed from Phe via a reaction mechanism similar to that involved in the formation of homoMet from Met, Phe must be transaminated by an aminotransferase prior to condensation with acetyl-CoA for the side chain to extend.

The advantages and limitations of coexpression analysis for glucosinolate biosynthetic genes

As described above, the coexpression analysis could predict many, although not all, of the genes involved in the biosynthesis of GSLs, especially Met-derived GSLs. This implies that the genes responsible for Met-derived GSL biosynthetic pathway (side-chain elongation, core structure formation, and side-chain modification) may be coordinately controlled by a limited number of regulatory components including Myb28, Myb29, and Myb76, at the mRNA accumulation level. Coexpression analysis could also effectively predict the candidate genes involved in the other secondary pathways, such as flavonoid and anthocyanin biosynthesis (Tohge et al. 2005; Vanderauwera et al. 2005; Yonekura-Sakakibara et al. 2007).

Quantitative trait locus (QTL) analysis is a powerful tool for identifying candidate genes involved in GSL biosynthesis as well as those involved in hydrolysis, for example, ESM1 (Epithiospecifier modifier 1, At3g14210; Zhang et al. 2006). ATTED-II analysis using a whole dataset showed weak correlation between ESM1 and ESP (Epithiospecifier protein, At1g54040) (data not shown). MAM genes that encode one of the Met side-chain elongation enzymes were also identified and characterized on the basis of the QTL analysis (Field et al. 2004; Textor et al. 2004; Kroymann et al. 2001, 2003). To my knowledge, however, some other genes that are involved in Met side-chain elongation process, namely, MAM-I (coding for methylthioalkylmalate isomerase) and MAM-D (coding for methylthioalkylmalate dehydrogenase) have not been identified by QTL analysis, presumably because natural variation of these genes does not result in metabolic natural variation. However, coexpression analysis could distinguish candidate genes i.e., MAM-I and MAM-D, from the putative Leu biosynthetic genes among the members of the same gene families (AtLeuCs, AtLeuDs, and AtIMDs) (Hirai et al. 2007). However, coexpression analysis requires previously-characterized genes such as MAMs as “guide genes” (Lisso et al. 2005), with which genes of unknown function are associated depending on whether coexpression relationship occurs. A combination of QTL analysis and coexpression analysis led to the identification of a flavin-monooxygenase (FMO) gene, FMO GS-OX , which is responsible for the side-chain modification of Met-derived GSLs (Hansen et al. 2007).

Although several Myb transcription factors controlling GSL biosynthesis could be predicted by coexpression analysis, this methodology is not sufficiently versatile to identify all regulatory genes. While the functions of at least three Mybs Myb28, Myb29, and Myb34 could be predicted by coexpression analysis (see Fig. 1), SLIM1, which codes for a transcriptional regulator involved in down-regulation of GSL biosynthetic genes under sulfur deficiency, could never be identified by coexpression analysis, because SLIM1 itself is not regulated at mRNA accumulation level under sulfur deficiency (Maruyama-Nakashita et al. 2006). Presumably, SLIM1 may be post-transcriptionally regulated in response to sulfur deficiency. Among Myb28, Myb29, and Myb34, at least Myb34 was shown to be down-regulated via a SLIM1-dependent mechanism in the roots of sulfur-starved Arabidopsis (Maruyama-Nakashita et al. 2006). The other regulators of GSL metabolism, IQD1 (At3g09710) (Levy et al. 2005), TFL2 (At5g17690) (Kim et al. 2004), and OBP2 (At1g07640) (Skirycz et al. 2006) did not show any obvious correlation with the known GSL biosynthetic genes in an ATTED-II analysis performed using a whole dataset (data not shown).

Fig. 1
figure 1

A correlation network comprising the known and candidate GSL biosynthetic genes. Coexpression relationship was analyzed by using Correlated Gene Search in PRIMe (Platform for RIKEN Metabolomics, http://prime.psc.riken.jp/) (Akiyama et al. 2008) using the following 35 genes as queries: Myb28, Myb29, Myb76, AtBCAT-4, AtBCAT-3, MAM1, MAM3, AtLeuC1, AtLeuD1, AtLeuD2, AtIMD1, CYP79F1, CYP79F2, CYP83A1, AtGSTU20, AtGSTF11, SUR1, UGT74B1, UGT74C1, AtSOT17, AtSOT18, FMO GS-OX , AOP2, PMSR2, MYB34, MYB51, MYB122, CYP79B2, CYP79B3, CYP83B1, AtGSTU8, AtGSTU3, AtSOT16, putative Tyr aminotransferase, CYP79A2. We did not include the genes responsible for Met and Trp biosynthesis into the queries, although some of them are regulated by some Mybs described here. AtGSTU8 and AtGSTU3 were coexpressed with known GSL biosynthetic genes under sulfur deficiency (Hirai et al. in press). Parameter setting was as follows: Matrix, All data sets v.3 (1,388 data of AtGenExpress); Method, interconnection of sets. The correlation data used in PRIMe have been released by ATTED-II. The gene pairs with PCC greater than 0.50 were selected, and the network was visualized by BioLayoutJava (Goldovsky et al. 2005). Transcripts from CYP79F1 and CYP79F2 were cross-hybridized to the same probe sets on a GeneChip microarray and hence are indistinguishable. The lengths of the lines depicted in this type of graph do not have any values

Figure 1 is a graph (so-called network) that indicates the coexpression relationship between the characterized and candidate GSL biosynthetic genes, which has been calculated using a whole AtGenExpress dataset. Among 35 query genes (see figure legend), the pairs of coexpressed genes (threshold PCC > 0.5) have been connected by lines. The graph represents 2 partially-overlapping modules. The larger and smaller modules consist mainly of Met- and Trp-derived GSL genes, respectively. The genes specifically involved in Met-derived GSL biosynthesis are not connected directly with those specifically involved in Trp-derived GSL biosynthesis, and vice versa. SUR1 and UGT74B1, the genes involved in both Met- and Trp-derived GSL biosynthesis (Mikkelsen et al. 2004; Douglas Grubb et al. 2004), are in the boundary region of two modules. It has been reported that the preferable substrates of the AtSOT17 product are Met- and Phe-derived GSLs (Klein et al. 2006; Piotrowski et al. 2004). Although graph structure depends on the dataset and the measure of similarity used, it may possibly suggest the functional relationship of the genes. Myb51 and Myb122, transcriptional regulators of Trp-derived GSL biosynthetic genes (Gigolashvili et al. 2007a), were not connected to any genes in this analysis (Fig. 1). However, Myb51 and Myb122 may form a condition-dependent network that can be drawn on the basis of the calculation using a sub dataset such as stress-series data, because at least Myb51 exhibits an expression pattern different from that of Myb34 with regards to tissue specificity and response to mechanical stimuli (Gigolashvili et al. 2007a).

Coexpression analysis can be applied to non-model Brassicaceae plants by analyzing their transcript profiles using comprehensive techniques such as cDNA-amplified fragment length polymorphism. In such a study, only a few previously-characterized GSL biosynthetic genes are expected, and hence, parallel analysis of their metabolic profile will help predict candidate genes involved in GSL biosynthesis. Integrated analysis of the transcriptome and the metabolome has led to the elucidation of functions of various other genes in many non-model plants (Saito et al. 2008).

Conclusions and perspectives

As described in this review, coexpression analysis has become an easy-to-use tool for functional genomics studies of Arabidopsis. There is certainly a possibility of selecting false positives as candidates, which is the drawback with other genome-wide large-scale analyses. To overcome this problem, novel algorithms for coexpression analysis have been reported in a number of bioinformatics articles and these algorithms have been validated by statistical analysis. However, large-scale analyses only provide clues that help in forming a hypothesis.. Hence, biologists who predict gene function by coexpression analysis should confirm the predicted function by performing wet lab experiments, regardless of the algorithm used.

In our studies, we identified candidate genes on the basis of coexpression relationships, and then selected some genes for further analysis from among the candidate genes on the basis of functional annotation. If a gene that is coexpressed with known GSL biosynthetic genes has a functional annotation, which is not expected on the basis of a priori knowledge of the GSL metabolic pathway, this gene may not be selected for further analysis since there may be a risk of false-positive results due to a coexpression relationship without any functional relationship. However, such a gene might be a novel, unexpected component of GSL metabolic pathway. I believe that new insights into a biological process can be provided by a non-targeted approach that is independent of a priori biological knowledge. An interesting study has recently been reported by Horan et al. (2008), in which 1,541 genes encoding proteins of unknown function were systematically associated with functional annotations of tightly coexpressed genes coding for proteins of known function. This type of genome-wide non-targeted approach will lead to the formation of a novel, data-driven hypothesis. In future, we should utilize large-scale biological methods for understanding a biological process completely, while taking into consideration the drawbacks of the methods (Aoki et al. 2007; Saito et al. 2008).