It has been established that co-regulated genes exhibit similar expression patterns as a norm and not as an exception [1]. Microarray allows sensitive, detection of small differences in transcript abundance [2], therefore it is utilized extensively to study co-regulation of genes. The gene expression imprinted in the microarray is the manifestation of the pathway activity undergone by the organism, and every gene performs its obligatory function in various pathways [3]. The genes do not co-express with the same set of genes all the time, but under various conditions will be expressed with different sets of genes termed as conditional coregulation [3, 4]. Each gene is estimated to interact with four to eight other genes and associated with 10 biological functions [5]. The DNA microarray therefore assists in measuring the difference in transcriptional activity by comparing their mRNA levels under different experimental conditions like developmental stages, stress, or osmotic shock [6]. Various approaches exist for interpretation of relative gene expression. One of the basic strategies is to set the expression level to three states, i.e. underexpression, baseline and overexpression using a fold change cutoff like two times fold change against the control [7]. Other strategies include setting thresholds representing significant changes between subsequent timepoints and storing in bins [8], an adaptive procedure that takes gene-specific variation into consideration to derive the gene expression in different states [7]. Further, microarray data consists of variations generally termed as interesting variations, which are biologically important, and are superimposed by "obscuring variation" or systematic variation [9].

We have developed an alternate algorithm based on graph theory that takes a discretized expression matrix as input and emits output string. Not all genes are expressed all the time and are required at different developmental and maturation phases of plant [10]. We therefore categorized each differentially expressed gene as "ON/OFF" from every experiment against its control in three states: overexpression (+1), underexpression (-1) and no expression (0). We utilized the regularized t-test to derive differentially expressed genes to overcome low replicates and extract meaningful biological variances [11]. We derived a discretized expression matrix for all the genes for various time series experimental conditions from differentially expressed genes. The similarity between two (or more) discretized vectors can be calculated through various distance measures such as number of positions the vector has similar values excluding 0 [2]. Our implementation results in the output string that is the representation of the pattern the genes have undergone during transition from one state to another. Any gene can be queried against all the other genes by matching stored output strings in the database, and a 'score' is generated representing the similarity index between any given set of genes.

Affymetrix™ provides a calculation of absolute signal values for each gene for a given set of experiments, which can be viewed as points in n-dimensional space (where n is the number of experiments) [12]. Similarity between point representations of genes can be calculated using various metrics like Pearson correlation or Euclidean distance using various clustering algorithms [13]. Most databases (e.g. Comprehensive Systems-Biology Database (CSB.DB) [14] and the Arabidopsis Microarray Database and Analysis Toolbox (GENEVESTIGATOR) [15] help to seek shared biological roles based on correlation. These databases utilize existing methods like Pearson correlation and Spearman rank correlation to measure coregulation. These clustering methods provide reliable information for performing internal comparison of experimental conditions. But the usage of these for cross comparisons of various groups/clusters obtained through clustering various experimental conditions tends to obscure information important for identification of coregulated genes [8]. This implies that the clustering methods, which have shortcomings in identifying all the relationships existing in microarray expressions and different algorithms, will identify unique relationships thereby limiting them to the constraints of conditional coregulation [8, 5]. Fuzzy k-means clustering implementation recognizes the concept of conditional coregulation and assigns 'membership' to each gene belonging to various clusters/groups [3].

We tried to incorporate the conditional coregulation for calculating relationships among genes by mining transcriptional data consisting of experiments in various conditions from AtGenExpress. For each condition in a temporal microarray experiment, the state of the gene at a particular time point is defined by alphabet according to the algorithm. The individual output string comprising of concatenated alphabets is stored on a per gene, per condition basis, meaning the length of the string equal to the number of time points for a particular experimental condition. Therefore, the number of alphabets and the complexity of a generated string for the given temporal experiment increases with an increase in number of timepoints. This implies that the algorithm performs better and produces more reliable results as the number of timepoints increase in the experiment. We tried to further reduce random matching of similar alphabets (at the same timepoint) by introducing an option to award extra increments if the preceding alphabet is matched.


Microarray datasets

The dataset consists of a total of 18 groups of experiments which were already preprocessed in MAS5.0 taken from aboveground samples of the abiotic stress series of microarray experiments conducted by AtGenExpress

Test data

The data consists of two sets of enzymes, i.e. nucleotide sugar interconversion enzymes [16] and glycosyltransferases, hypothesized to be involved in cell wall biosynthesis which consists of 493 genes for analyzing results.

The overview of the entire analysis and database construction is shown in Figure 1.

Figure 1
figure 1

Overview of the database construction. E1, E2, E3, En is the representation of the time points in temporal experiments. G1, G2, G3, Gn is the representation of the gene. Confidence values and fold change for relative gene expression generated from Cyber-T.

Setting up database

Storage of preprocessed data

We utilized Cyber-T to measure the confidence value associated with fold change for each gene. The Cyber-T analysis (Control versus Experiment) uses Bayesian probabilistic framework to calculate a background variance for each of the genes under analysis. By combining the empirical variance with the local background variance associated with neighboring genes, it calculates the confidence associated with the differential expression [17, 11]. This is supposed to compensate for the limited number of replicates by giving proper estimates of variances (which might be biologically relevant).

Generation of string

The input matrices f1f2f3fn for fold change and p1p2p3pn for confidence value are generated from the analysis of Cyber-T, for each Control + Experiment. The fold change is the ratio of expression value between experiment and control, and negative fold change indicates lower expression in the experiment and vice versa. Discretization is utilized to reduce the variables in sample space resembling "lossy compression method for the data" [8]. Relative gene expression of every gene in each column (control plus experiment) is discretized into three distinct levels of differential expression – overexpressed, underexpressed and not expressed. These levels of differential expression are validated by the following rules:

  1. i.

    An upregulated gene with a fold change (ExpressionExperiment/ExpressionControl) greater than +1 and p value less than the threshold is deemed as overexpressed and represented as positive state.

  2. ii.

    A downregulated gene with a fold change (ExpressionControl/ExpressionExperiment) less than -1 and p value less than the threshold is categorized as underexpressed and represented as negative state.

  3. iii.

    A gene with p value above the threshold is categorized as not expressed and represented as neutral state (0).

The algorithm utilizes an assumption that genes in a time series do not occupy discrete, independent expression devoid of any relationship with their previous experiments in a temporal series. In fact, the present state of each gene is dependent on its immediate preceding state or vector for magnitude and therefore generates an output string utilizing the graph theory [18] to depict a pattern. The algorithm uses these three levels of discretization to set the present state of each gene based on present and previous states.

The implementation of the algorithm is discussed as pseudo code below. The initial vector (state) of the gene is set by the above rules of discretization, which states the direction is positive or negative based on the level of expression and magnitude, is equal to the number of timepoint experiments as in line 6. Lines 10 to 21 of the algorithm describe the transition of genes in a timepoint experiment with resulting variable 'out'. The out variable in the algorithm is based on both the direction and magnitude of its previous state and present state as in lines 12, 14, 17 and 19. Each 'out' variable can be considered as an alphabet representing the state of the gene and stored in the matrix on a per gene, per experiment basis.

Implementation of the algorithm

a) i is the number of the gene in the row

b) j is the number of the column (experiment)

c) |len| is the initial output assigned to the output string equal to number of time points for time series experiments

d) state is level of differential expression, i.e. overexpression, underexpression or no expression for a gene at a timepoint

e) out is the output alphabet for the transition of each gene i in jth column

f) PV(i, j) is the matrix of confidence value(Bayes P) of i gene in jth column

g) FC(i, j) is the matrix of Fold Change of i gene in jth column

h) DM(i, j) is the matrix storing the output alphabet (out) as string

i) threshold is the user defined Bayes P value cutoff which has been taken as 5% arbitrarily

  1. 1.

    iterate for each gene: 0 to i {

  2. 2.

    iterate for each time series experiment: 0 to j {

  3. 3.

    if (PV(i, j) <= threshold) {

  4. 4.

    if (previous state !exists) { # true for first column

  5. 5.

    if (present state = (overexpressed | underexpressed)) {

  6. 6.

    out = ± len; DM(i, j) = out; } #+ overexpression, - underexpression

  7. 7.

    else if (present state = (not expressed)) {

  8. 8.

    out = 0; DM(i, j) = out; } }

  9. 9.

    else if (previous state exists) {

  10. 10.

    if (present state = previous state) {

  11. 11.

    if (FC(i, j) > FC(i, j-1) {

  12. 12.

    increment out; DM(i, j) = ± out; } #direction (±) based on present state

  13. 13.

    else {

  14. 14.

    decrement out; DM(i, j) = ± out; } }

  15. 15.

    else if (present state != previous state) {

  16. 16.

    if (FC(i, j) > FC(i, j-1) {

  17. 17.

    increment out; DM(i, j) = +out; }

  18. 18.

    else {

  19. 19.

    decrement out; DM(i, j) = -out; } }

  20. 20.

    } # 9 ends

  21. 21.

    } # 3 ends

  22. 22.

    else {

  23. 23.

    out = 0; DM(i, j) = out; }

  24. 24.

    } } # 1 ends

Note: # is the comment

The total number of alphabets generated for a time series n is 3+4(n-1), and the number of strings (complexity) resultant from these alphabets for a time series n greater than 1 is 3(n-1)+2(2n-1). For each individual experiment, a seperate string is generated with length equal to the timepoint of each experiment. These input values produce sequences of transition of a gene resulting in an output string o1o2o3...on representing the expressional changes a gene has undergone that are stored in the database.

Storage of generated string

The strings generated from the algorithm, for the time series experiment under different experimental conditions are stored in a RDBMS database. Each individual string is generated for a particular gene in a particular time series experiment and is stored separately.

Generating a score through querying

The score is generated by matching strings of query gene with all the other genes in the database. For each experiment, the string of query gene is compared with strings of other genes, per experiment. The comparison is performed for all the experiments and aggregate score is computed. The string matching comprises of two determinants as follows:


An alphabet of query string matching with alphabet of other genes at same temporal point of a time series experiment is awarded a unit.


Discretization leads to generalization of entire data. To overcome this aspect, any random similarity of the output alphabet is checked by further awarding an extra unit to any matching output alphabet having a preceding match. Weight provides an additional thrust to seek for genes undergoing similar patterns by separating them from genes with random similarity.

Score = 1 E 1 T G n MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaee4uamLaee4yamMaee4Ba8MaeeOCaiNaeeyzauMaeyypa0ZaaabCaeaadaaeWbqaaiabdEeahjabd6gaUbWcbaGaeGymaedabaGaemivaqfaniabggHiLdaaleaacqaIXaqmaeaacqWGfbqra0GaeyyeIuoaaaa@3EE8@

T: is the number of time series experiments

E: is the number of experiments conducted in various conditions

Gn: is the value calculated in relation to a gene against a query gene using match and weight factors.

The web interface ASIDB at present defaults match and weight as a unit. The standalone java interface with a local database has a more dynamic interface allowing the user to enter these variables.

Results and discussion

Genes represented in the ATH1-1250 chip can be queried based on Affymetrix™ Probeid or AGICode [1]. We utilised carbohydrate biosynthesis genes as a test case and at present, the database (hosted at refers to 493 genes annotated as glycosyltransferases and nucleotide sugar interconversion enzymes for Arabidopsis thaliana. The 'scores' generated by querying the database are retrieved in descending order of 'score' which we refer as 'rank' and used synonymously with 'score' for discussion. We have classified the results into two sections:

A. Comparison of results with correlation coefficients derived from CSB.DB

Single gene query

We queried QUA1 (At3g25140) from our database and listed the top 15 genes (Table 1). We merged our results with CSB.DB by utilizing parametric Pearson's linear product moment correlation coefficient and the output using positive co-responding genes with probability < 0.05 by performing single gene query for QUA1 for atge0200 dataset. We found that many genes which generate high scores in ASIDB also have high confidence and high pearson in CSB.DB. Also, our results do not exclude genes unlike CSB.DB for the same dataset giving a more comprehensive picture for better analysis.

Table 1 Comparison of ASIDB score with CSB.DB for single gene. Descending order of score from ASIDB compared with the pearson correlation and the p value from CSB.DB for QUA1 for atge0200

Subgroup (five UGE isoforms) comparison

Performing co-response analysis for UGE isoforms by comparing fluctuation of transcript abundance between CSB.DB (Spearman correlation) and ASIDB revealed UGE1 and -3 behaved differently than UGE2, -4, -5 [19].

Random comparison of genes from CSB.DB

We randomly selected around 10% of the genes (~52 genes) from our database and compared them with Pearson correlation generated from CSB.DB. The correlation coefficient measures the relationship between two variables ranging between +1 and -1 where +0.7 to +1.0 is considered strong positive association, +0.3 to +0.7 as weak positive association and +0.3 to 0 as no association [20]. Comparing 1326 correlated pairs generated from CSB.DB and ASIDB, we found that at high pearson correlation, i.e. .8, out of 14 correlated pairs generated by CSB.DB, 10 were identified by ASIDB at rank cutoff of 10 genes (top10 genes) and 13 correlated pairs at rank cutoff of 20 (Figure 2). Similarly for pearson correlation cutoff at .7, we were able to identify half (21) of the correlated pairs at rank cutoff of 10 and 30 at rank cutoff of 20. We observe the trend that ASIDB identifies most of the correlated pairs at high pearson value and identification reduces with declining pearson for the same genes. ASIDB also generates a higher number of relationships that are not identified by pearson correlation and might hold biological importance. The comparison file can be downloaded from

Figure 2
figure 2

Comparison of pearson correlation with ASIDB score. Graph depicts the number of correlated pairs at different pearson cutoff and the number of those same correlated pairs identified by ASIDB at different rank cutoff.

B. Biological validity of the results

We utilized genes hypothesized to be involved in cell wall biosynthesis (glcosyltranferases and nucleotide interconversion enzyme) as test data. Out of these, around 420 genes that are present in carbohydrate active enzymes are glycosyltransferases while the remaining genes are linked with nucleotide sugar interconversion enzymes. The glycosyltransferases are specific for both the donor sugar nucleotide and the acceptor molecule, which might be another sugar or aglycones. One glycosyltransferase usually catalzses the formation of only one glycosidic linkage. As a result, though many glycosyltransferases catalyze chemically similar reactions, they display remarkable diversity in their donor, acceptor and product specificity and thereby generate a potentially infinite number of glycoconjugates, oligosaccharides and polysaccharides [21]. The high specificity of glycosyltransferases results in difficulty in assessing the biochemical function of the enzymes encoded by these genes [22]. Furthermore, the application of clustering or any other technique has not yielded a precise donor, acceptor or product specificity of glycosyltransferases [23].

The activated sugars, known as nucleotide sugars, form the substrates for nucleotide sugar interconversion enzymes dedicated to the generation of new sugar species [24]. The biochemical and molecular aspects of sugar nucleotide interconversion enzymes are fairly well understood [25]. Also, nucleotide sugars are the substrates for glycosyltransferases that catalyze the polymerization of monosaccharides into glycosides, oligosaccharides, glycolipids, glycogen, starch, cellulose and a large variety of extracellular complex carbohydrates [24, 26]. An improved description of the link between nucleotide sugar interconversion genes and glycosyltransferases might help in understanding the control of cell wall biosynthesis [16].

Relationship of UGE with glycosyltransferase

On querying five UGE isoforms (Table 2), we subgrouped them into two groups based on their relationship with each other and glcosyltransferases MUR3 (At2g20370), GOLS2 (At1g56600) and ATGT18 (At5g62220).

Table 2 Comparison of ASIDB ranking with pearson correlation for UGE homologs. The ranking of genes depicting relationship between UGE homologs and glycosyltransferases with pearson correlation is in parenthesis.

i) UGE1, -3

UGE1 and UGE3 co-regulated with trehalose 6-phosphate synthases [27] indicating their catabolic role [19, 28].

ii) UGE2, -4, -5

We observed that UGE2,-4 and -5 co-regulated with known galactosyltransferases like MUR3, GolS2, ATGT18 indicating biosynthesis role [19, 28].

Relationship among cellulose synthases

The Arabidopsis genome encodes 10 isoforms of the cellulose synthase catalytic subunit – CESA [29] and we have broken down these groups of genes into two sets based on the rank derived from Table 3.

Table 3 Comparison of ASIDB ranking with pearson correlation for CESA homologs. The ranking of genes depicting relationship between CESA subgroup derived from the score with the pearson correlation in parenthesis.
  1. i)

    AtCesa1, AtCesa3, AtCesa6, AtCesa5, AtCesa2

Cesa1,-3,-6 are responsible for cellulose production during primary cell wall development in various tissues [30]. On querying ASIDB, we found that not only CESA2 which have been earlier found to be highly co-regulated with CESA1,-3,-6 came at higher rank [30], but CESA5 also shows co-regulation with this set of genes indicating its role in deposition of cellulose in the primary cell wall.

ii) AtCesa4, AtCesa 7, AtCesa 8

The irx1, irx3 and irx5 mutants are the members of CesA gene family and AtCesA4 (IRX5), AtCesA7 (IRX3) and AtCesA8 (IRX1) take part in the synthesis of the complex that is required to synthesize cellulose in the secondary cell wall [30, 31]. Other genes, which have shown significant relationship with these genes and have also been mentioned in previous work, are At5g54690, At2g37090 [30].


We tried to incorporate the conditional co-regulation for calculating the relationship among genes by addressing the limitation of clustering methods in which genes that are expressed in most of the measurements are highlighted while genes that are co-expressed in the subset of conditions are omitted. Our implementation leads to the inclusion of every gene regardless of its expression values thereby highlighting an important relationship absent in other contemporary databases. We found that our approach not only recognizes most of the randomly selected correlated pairs from pearson correlation in CSB.DB, but also generates new relationships. This has resulted in highlighting subtle relationships for example UGE isoforms [19, 28].

Taking carbohydrate metabolism as a test case, we observed that those genes known to be involved in similar functions and pathways generate a high 'score' with the queried gene. The 'score' is therefore computed for all the genes present in the database and reflects the magnitude of co-regulation existing with the queried gene. The higher the intersection of expressional patterns under varying conditions, the higher the score generated for the gene calculated by the algorithm (Figure 3) and ranked in descending order of 'score'. Interpretation of the results is done by considering genes as nodes linked with each other through the edges [32]. Edges represent interactions between the connected genes, with higher rank/score depicting higher functional similarity (Figure 4). We can utilize the interaction of edges and nodes for the construction of the networks.

Figure 3
figure 3

Generation of scores from different experimental conditions. The generation of a score resulting from the overlapping of similar expressional patterns under varying conditions with the query gene.

Figure 4
figure 4

Depiction of network relationship between genes. The bi-directional relationship generated between gene A and gene B connected by edges. The edges represent score/rank by performing querying for each gene.

After implementing the carbohydrate biosynthesis cycle, we intend to incorporate other cycles like amino acid, nucleotide, and lipid. We have recently added ARACYC annotation to our database.