1 Introduction

Gene expression data analysis enables biologists to narrow the space of probable solutions when performing expensive genomic experiments. This is possible due to the revolutionary microarray technology that monitors expression levels of thousands of genes in a single experiment. A microarray chip is a solid surface of glass, silicon or plastic substrate with probes of DNA or RNA molecules arranged at inter-sectional points in a grid structure. A typical microarray experiment involves preparation of microarray chip, isolation of DNA or RNA molecules from a cell sample, and promotion of hybridization by allowing the extracted DNA or RNA molecules to come in contact with probes. Before hybridization the extracted DNA or RNA solution is dyed with color so that color intensities at the probe spots represent amount of hybridization. Finally, image processing techniques are applied to determine color intensities at different probe spots that represent expression of genes associated with the probes.Analysis of expression data has many applications. Some notable applications are pharmagenomic research and drug discovery, infectious and genetic disease and cancer diagnostics, forensic genetic identification, proteomics and cellular analysis etc (Heller 2002). Data-mining techniques such as classification, clustering, biclustering and triclustering (Jiang et al. 2004; Ahmed et al. 2011a, b; Das et al. 2010; Mahanta et al. 2011) are widely used in gene expression data analysis tasks such as prediction of unknown gene functions, inference of regulatory relationships among genes and disease diagnosis. A number of data-mining techniques have also been proposed specifically for gene expression data analysis. A number of preprocessing tasks such as handling missing values, normalization and feature extraction (Donders et al. 2006; Schadt et al. 2001; Van Hulse et al. 2012) are performed on gene expression data to make it more effective for different applications.

Gene expression data is usually analyzed to find groups of genes associated with the same biological functions. However, the data produced by microarray experiments hold ample resource for evidence of various biological regulatory relationship among genes as well. Therefore, reverse engineering this expression data to discover these regulatory activities can be very useful for the biologist before conducting in vivo experiments. This reverse engineering task requires representation of the regulatory framework in a computational model. A number of such attempts have been recorded in the literature of gene expression data analysis (De Jong et al. 2002; Zhou et al. 2012). Early researchers tried to model gene regulatory framework considering only pairwise individual interaction among genes by determining pairwise correlations between pairs of genes. Such correlations can be graphically represented in a co-expression network. Gene co-expression networks illustrate associations among genes in terms of their expression similarity and a network-level view of the similarity among a set of genes. Thus, gene co-expression networks provide a good approximation to the complicated web of gene functional associations (Ruan et al. 2010). In co-expression networks, two genes are connected by an undirected edge if their activities have significant association (Lee et al. 2004) computed using gene expression measurements such as Pearson correlation, Spearman correlation and mutual information. A primary objective in designing a co-expression network is to extract highly connected regions from the constructed co-expression network. These regions are often termed network modules. In biological terms, a network module may represent a functional category or a set of co-regulated genes.

In this work, we propose a method to construct co-expression networks reflecting correlations among genes over subspaces of samples and to extract network modules. We also introduce a measure to determine connectivity among genes over subspaces of samples.

1.1 Organization of the paper

The rest of the paper is organized as follows: Section 2 presents related work. Motivation and contributions of the paper are given in Sects. 3 and 4, respectively. The proposed method is described in Sect. 5. Experimental results are provided in Sect. 6. Section 7 presents concluding remarks.

2 Related work

A number of techniques have been proposed for construction of co-expression network in gene expression data analysis. A typical co-expression construction technique accepts a gene expression dataset, computes pairwise correlation among gene expressions and constructs either a weighted co-expression network (where weights are normally the correlation score between the pair of genes) or an unweighted network (where an edge is placed between a pair of nodes if the corresponding gene expressions are correlated with a score more than a threshold). Most of these techniques also provide methods to extract dense network modules which may represent a biologically significant groups of genes. The measures which are frequently used for evaluating correlation in co-expression network analysis are Pearson correlation coefficient, Spearman correlation coefficient and Mutual information. Butte et al. (2000) use Pearson correlation coefficient to place edges among genes in a co-expression network. Spearman correlation coefficient is used as a gene expression similarity measure to construct co-expression networks in D’Haeseleer et al. (1998). Butte and Kohane (2000) and Steuer et al. (2002) reports the use of mutual information to find similarly expressed gene pairs in such networks. While some techniques operate on the adjacency matrices of networks in order to partition network nodes into groups (Lee et al. 2004; Zhu et al. 2005), other techniques rely on special purpose algorithms for identifying subnetworks with certain properties (Stuart et al. 2003). The method called Qcut (Ruan et al. 2010) constructs co-expression networks using rank-based Pearson correlation between a node and rest of the nodes. It also optimizes modularity, which is an objective function defined as the difference between percentage of intra-community edges and random expectation. Table 1 lists different co-expression network construction and module extraction techniques.

Table 1 A comparison of some existing co-expression network construction techniques

3 Motivation

Our literature review finds that the existing techniques of gene co-expression network construction consider all samples when determining correlations among genes. But, it is well known in molecular biology that only a small subset of the genes participates in any cellular process and that any cellular process takes place only in a subset of the samples (Jiang et al. 2004). The fact can be easily understood if we look at the origin of the gene expression data. Gene expression data is the outcome of microarray experiments that measure expression levels of genes in a number of samples. In applications such as function prediction of genes and disease diagnosis, we are not only interested in evaluating correlation between a pair of genes over whole set of samples because a biological process may occur only in a subset of samples. To elucidate this statement let us consider the two expression patterns corresponding to two genes g1 and g2 as shown in Fig. 1. The figure plots expression patterns of g1 and g2 (solid lines) over set of samples c1, c2, c3, c4, c5, c6, c7 and c8. Now let us consider expression patterns of the genes over subset of samples c1, c3, c5, c7 and c8 (dotted line). Correlations over subspace of samples are normally ignored by most co-expression network construction techniques, and evaluate correlation over whole sets of samples. But as per requirements of certain applications, these patterns are of prime importance. Therefore, there is a need to incorporate subspace correlations rather than full-space correlations when constructing co-expression network for more biologically relevant results.

Fig. 1
figure 1

Two gene expression patterns with correlation over subset of samples {c1, c3, c5, c7, c8}

Most correlation measures are sensitive to the order of samples in the gene expression matrix. But if we analyze the gene sample microarray data, we see that there is no explicit ordering of samples. So a correlation measure should be robust to the problem of order sensitivity. This problem can be avoided by operating on each possible pair of samples instead of operating only on consecutive pairs of samples in the gene expression data. Another problem in the process of determining the correlation between two expression patterns is the presence of noisy data. A noisy expression value of a gene against a sample should not affect the correlation across the rest of the samples.

4 Contributions

Our work is aimed to design a framework for construction of co-expression networks that takes into account correlations among genes over subsets of samples. The work also proposes a connectivity measure that can be used with subspaces among genes when constructing gene co-expression networks. This paper makes the following contributions to gene expression data analysis.

  1. 1.

    We propose a representation for co-expression networks that can handle correlations among genes over subsets of samples. We use the term subspace co-expression network to refer to such a network. A subspace co-expression network can be graphically viewed as a multigraph. Multiple edges between two nodes representing a pair of genes in the multigraph correspond to different subsets of samples across which the genes are correlated.

  2. 2.

    We propose a TOM (Ravasz et al. 2002)-based connectivity measure named TSOM (Topological Subspace Overlap Metric) which can be used to determine connectivity between a pair of genes in the proposed subspace co-expression network.

  3. 3.

    We also introduce an unsupervised module extraction technique from the subspace co-expression network based on (1) TSOM measure and (2) k-means clustering algorithm.

5 Method

When we consider a set of samples, a pair of genes can be correlated in more than one subset of samples. Considering this fact, to represent a gene co-expression network, i.e. a graph G(VE), where V is the set of vertices or genes and E is the set of edges, and between a pair of vertices {V i V j }, there can be multiple edges corresponding to multiple subspaces of samples. we use multigraph to represent gene co-expression network where multiple edges between a pair of genes or nodes correspond to different subspaces of samples. We consider two genes to be correlated for a pair of samples if arctan of the angles formed by the expression values for the sample pair do not differ by more than a threshold. This threshold is computed as a function of standard deviation of arctan of expression values for all the genes over the pair of samples. Per our notion of correlation, a pair of genes can be correlated over different subsets of samples. For example the gene expression pattern in Fig. 2 can be correlated with another gene separately in two subsets of samples I, II, IV, VI, VII and I, III, V, VI, VII. Figure 2 plots patterns formed by expression values of the gene over these two subsets of samples from the set of possible subsets of samples. This point onwards, we will use the term ordered sample subset to refer to such a subset of samples.

Fig. 2
figure 2

Two ordered sample subsets of a gene expression pattern

The symbols provided in Table 2 and the definitions given next are used to describe the proposed method.

Table 2 Symbolic representations

Definition 1

A sample subspace for a gene expression dataset is defined as a subset of samples or conditions associated with the gene expression dataset. We use the term subspace to refer the possible sample subspaces in a gene expression dataset.

Definition 2

A subspace co-expression network is a co-expression network with the ability to convey information about the subset of samples over which genes or nodes are correlated. A subspace co-expression network can be represented by a multigraph G = (V, E) where V is a set of vertices that represent genes and E is a set consisting of unordered pairs of vertices each of which is associated with a subset of samples.

Definition 3

An adaptively discretized matrix D is a discretized form of expression matrix G of order m × (n × (n − 1)), where m is the number of genes in G and n is the number of samples in G such that D p,q i and D p,q j are assigned the same discrete value if |arctan(G(ip) − G(iq)) − arctan(G(jp) − G(jq))| ≤ λ × β(pq), where λ is a user defined threshold named deviation ratio.

Definition 4

Two genes g i and g j are said to be correlated over an ordered sample subset \(\{s_1,s_2,\ldots,s_n\}\) if \(D_i^{s_k,s_{k-1}}=D_j^{s_k,s_{k-1}},\,k=2,3,\ldots,n\)

A block diagram of our method is presented in Fig. 3. The expression matrix is discretized considering angles formed by expression values of genes for each pair of samples. These discretized expression data are used to form a multigraph that represents the subspace co-expression network. Then pairwise TSOM values among genes are computed from the multigraph. These connectivity scores are stored in the form of a matrix. Finally, K-means (Hartigan and Wong 1979) algorithm is employed to extract network modules using the connectivity matrix.

Fig. 3
figure 3

Proposed methods of co-expression network construction and module extraction

5.1 Adaptive discretization

In this work, we adaptively discretize the expression matrix of order m × n to a discrete matrix of order \(m\times\frac{n\times(n-1)}{2}\). In the discretization process, for each pair of conditions, the angle formed by the expression values corresponding to a pair of conditions is computed as arctan of the difference of the expression values, as presented in Fig. 4. For two genes g i and g j , arctan value for pth and qth conditions is computed as,

$$ \theta = arctan(G(i,p)-G(i,q)) $$
Fig. 4
figure 4

Computing angle formed by a pair of expression values of a gene corresponding to a pair of conditions

These arctan values are then discretized based on a threshold by assigning all the values close by an amount less than this threshold the same discrete value. This threshold is computed as a function of standard deviation of the values. The discretization process is presented next in details.

Algorithm Adap-discretize()

INPUT: Expression matrix, G, Deviation Ratio, λ

OUTPUT: Discretized matrix, D

Steps:

For each pair of conditions c i and c j in G

  • Convert each pair of expression values corresponding to two conditions of genes to its arctan form.

  • Sort these arctan values.

  • Group the sorted values by placing the values that are close to each other with a distance <λ × β in the same group, where β is the standard deviation.

  • Assign all the members of each group a unique alphabet.

5.2 Subspace co-expression network construction

The subspace co-expression network is an enhancement of the traditional co-expression network. This network can trace the subsets of samples along which a pair of genes are correlated. Nodes in this network represent genes in the expression data. We use the terms nodes and genes interchangeably in this paper. Each edge between a pair of nodes (genes) is labeled with a subset of samples. Per angular similarity that is used to define correlation between two genes, there can be multiple subsets under which a pair of genes are correlated. Unlike traditional co-expression networks, where there can be only one edge between a pair of genes or nodes, subspace co-expression network may contain more than one edge between a pair of genes or nodes as shown in Fig. 5a, b. An edge represents a subset of samples over which the vertices are correlated. This multigraph is constructed as follows:

  • For each pair of genes g i and g j ,

  • Compare corresponding entries in D to find the pairs of samples under which g i and g j have the same discrete value.

  • Process the pairs to derive the ordered sample subsets. For example, if (1, 5), (3, 5) and (5, 6) are the similar sample pairs, the derived ordered sample subsets are 1, 5, 6 and 3, 5 ,6.

  • For each ordered sample subset, draw an edge between the nodes corresponding to g i and g j with the sample subset as its label.

Fig. 5
figure 5

Traditional versus subspace co-expression network

5.3 Topological subspace overlap metric

The topological overlap metric (Ravasz et al. 2002) is a similarity measure, which is useful in biological networks. This measure is generally defined for weighted and unweighted networks where there is a single edge between a pair of nodes. For unweighted networks (i.e., a ij  = 1 or = 0), the topological overlap matrix is defined by (Ravasz et al. 2002),

$$ w_{ij}=\frac{l_{ij}+a_{ij}}{min(k_i,k_j)+1-a_{ij}} $$
(1)

where l ij  = ∑ u a iu a ij is the number of common neighbors and k i  = ∑ u a iu is the node connectivity.

TOM can be used to determine connectivity in a network where there can be at most one edge between a pair of nodes. So this measure cannot be applied on a subspace co-expression network which may have multiple edges between a pair of nodes corresponding to different subsets of samples over which genes or nodes are correlated. We propose a TOM-based connectivity measure named TSOM (Topological Subspace Overlap Metric) which can be applied on a subspace co-expression network .

The Topological Subspace Overlap Metric between ith and jth nodes in a network with n nodes is defined as,

$$ TSOM_{ij} =0.5\times \eta + 0.5\times \omega $$
(2)

where

Neighbourhood similarity, \( \eta = \frac{{\sum\nolimits_{{u = 1}}^{n} {| {C_{{iu}}^{p} \cap C_{{uj}}^{q} } |} }}{{\sum\nolimits_{{u = 1}}^{n} {min( {| {C_{{iu}}^{p} } |,| {C_{{uj}}^{q} } |} )} }},\,1 \le p \le n_{{iu}} ,\,1 \le q \le n_{{uj}} \) such that |C p iu  ∩ C q uj | is maximum and |C p iu C q uj | > 2,

Direct connectivity, \( \omega = \frac{{| {C_{{ij}}^{r} } |}}{s},\,1 \le r \le n_{{ij}} \) such that |C r ij | is maximum,

n ij is the total number of edges between ith and jth nodes, and

s is the number of samples in G.

When computing connectivity between a pair of nodes, we consider two components viz., neighbourhood similarity η and direct connectivity ω. Direct connectivity takes into account the edge between nodes corresponding to the genes, g i and g j while neighbourhood similarity takes into account the the nodes which are connected to both the nodes corresponding to genes g i and g j . To avoid biasing, we have assigned equal weights (i.e., 0.5) to both the components η and ω.

5.4 Extraction of network modules

To extract modules from the subspace co-expression network, pairwise TSOM scores of the genes or nodes are computed. Using these connectivity scores as similarity values, the k-means (Hartigan and Wong 1979) clustering algorithm is applied to produce clusters. The k-means clustering algorithm iteratively partitions set of objects into groups of similar objects. The algorithm starts with n initial seeds and assigns the rest of the objects to one of these seeds. After the assignment, centroids of the partial clusters are computed as seeds for the next iteration. The process is repeated until the centroids do not change. The clusters extracted by k-means algorithm are considered extracted network modules. Discovery of these modules involve the following steps.

  • Compute pairwise TSOM scores of the genes to generate a connectivity matrix Con of order m × n, where m is the number of genes and n is the number of samples in G.

  • Subtract each similarity value in Con from 1 to construct distance matrix Dist.

  • Feed Dist is then fed to k-means clustering algorithm to obtain the clusters. These clusters actually represent relatively dense regions in the network and are extracted network modules. To determine the value of k, we tried different possible values of k and choose the one with highest biological significance(i.e. lowest p or Q value) as shown in Fig. 6.

Fig. 6
figure 6

Determination of k value for extraction of network modules

6 Experimental results

We implement the method in MATLAB and test it on three benchmark microarray datasets given in Table 3. The test platform is a Sun workstation with Intel(R) Xenon(R) 3.33 GHz processor and 6 GB memory running Windows XP operating system.

Table 3 Datasets used for evaluation

6.1 Validation

The performance of the algorithm on the publicly available benchmark microarray dataset is measured in terms of p value (Tavazoie et al. 1999) and Q value (Benjamini and Hochberg 1995). Some of the network modules extracted by our method are visually presented in Fig. 7a, b for Dataset 1 and Fig. 8a, b for Dataset 2 and Fig. 9a, b for Dataset 3.

Fig. 7
figure 7

Visualization of different network modules of Dataset 1

Fig. 8
figure 8

Visualization of different network modules of Dataset 2

Fig. 9
figure 9

Visualization of different network modules of Dataset 3

6.1.1 p value

We evaluate biological significance of the extracted network modules using p value (Tavazoie et al. 1999). p value signifies how well the genes in a network module match various GO categories. A low p value for the set of genes indicates that the genes belong to enriched functional categories that are biologically significant. A cumulative hypergeometric distribution is used to compute the p value. For a given GO category, the probability p of getting k or more genes within a cluster of size n, is defined as (Berriz et al. 2003),

$$ p = 1 - \sum\limits_{{i = 0}}^{{k - 1}} {\frac{{\left( {\begin{array}{ll} f \\ i \\ \end{array} } \right)\left( {\begin{array}{ll} {g - f} \\ {n - i} \\ \end{array} } \right)}}{{\left( {\begin{array}{ll} g \\ n \\ \end{array} } \right)}}} $$
(3)

where f and g denote the total number of genes within a category and within the genome, respectively.

To compute p value, we use the Web-based tool called FuncAssociate (Berriz et al. 2003). FuncAssociate uses Molecular Function and Biological Process annotations in Gene Ontology to compute the hyper-geometric functional enrichment score. The enriched functional categories for some of the network modules obtained by our algorithm on the datasets are presented in Tables 4 and 5. The network modules produced by the method for these datasets contain the highly enriched functional categories of growth factor activity, regulation of cell cycle, cell cycle process, cell division, cellular bud neck, cell division, cell cycle with p values of \(4.82 \times 10^{-07},\,2.17\times10^{-06},\, 4.68\times10^{-09},\,1.01\times10^{-06},\,1.85\times10^{-07},6.12\times 10^{-09},8.10 \times 10^{-11}\) respectively, being the highly enriched GO categories. Table 9 presents some of the functional categories which are better or equally detected by our technique as compared to Qcut (Ruan et al. 2010) technique for Dataset 3 in terms of p value.

Table 4 p value of one of the network modules of Dataset 1
Table 5 p value of some of the network modules of Dataset 3

6.1.2 Q value

The Q value (Benjamini and Hochberg 1995) for a set of genes is the proportion of false positives among all genes that are as or more extremely differentially expressed. GeneMANIA (Warde-Farley et al. 2010) reports GO categories and Q values from an FDR (False Discovery Rate) corrected hypergeometric test. Q values are estimated using the Benjamini Hochberg procedure (1995). Different GO categories of the co-expression networks produced by the method are displayed up to a Q value cutoff of 0.1 in Tables 6, 7 and 8. The co-expression network modules produced by the method contain the highly enriched functional catagories of actor binding, negative regulation of neuron apoptosis, regulation of platelet activation, O-acyltransferase activity, carboxy-lyase activity, secondary metabolic process, response to wounding, toxin catabolic process with Q values of \(4.6 \times 10^{-10},\,4.05\times 10^{-6},\, 4.36\times 10^{-6},\,3.95 \times 10^{-6},\, 3.6\times 10^{-7},\,9.03 \times10^{-16},\,2.61\times 10^{-11},\,2.89\times 10^{-10}\), respectively, being the highly enriched GO categories. From the results of p and Q values, we can conclude that our method shows a good enrichment of functional categories and therefore is able to discover modules with a good biological significance. Table 10 presents some of the functional categories which are better or equally detected by our technique as compared to Qcut (Ruan et al. 2010) technique for Dataset 3 in terms of Q value.

Table 6 Q value of some of the network modules of Dataset 1
Table 7 Q value of some of the network modules of Dataset 2
Table 8 Q value of some of the network modules of Dataset 3
Table 9 Comparison of results of Qcut and proposed technique for Dataset 3 in terms of p value
Table 10 Comparison of results of Qcut and proposed technique for Dataset 3 in terms of Q value

7 Conclusions

In this paper, we have proposed a method to construct co-expression network over subspaces of samples. A connectivity measure has also been proposed to evaluate connectivity between a pair of nodes in the proposed network structure. We use k-means clustering to extract network modules from the generated subspace co-expression network. We validate extracted modules from multiple real life datasets using p value and Q value. The results are highly satisfactory.