Background

It is of great importance to capture the reliable molecular biomarkers, which is able to accurately diagnose or even predict the relevant clinical characteristics for a new patient with complex diseases [1]. The traditional approaches can be broadly divided into three categories: expression-based, network-based and function-based methods. Expression-based methods, such as SAM [2], obtained some gene sets as the molecular biomarkers according to differential expression pattern between normal and disease samples (Fig. 1). However, cellular heterogeneity within tissues and genetic heterogeneity across patients could weaken the discriminative power of individual genes [3, 4] and then affect the final performance of expression-based methods on independent datasets. Meanwhile, network-based methods, such as frSVM [5] and stSVM algorithm [6], were proposed to extract the active sub-networks as network biomarkers, by considering biological network information (Fig. 1). Clearly, such analysis could make us further interpret the mechanisms of complex diseases at a system level [7].Although network biomarkers pay attention on a growing consensus that complex diseases are mostly contributed by multiple genes through their sophisticated interactions rather than by the individual genes [8, 9], network-based analysis could not directly elucidate the biological or functional roles of the excavated genes/interactions on specific conditions or samples due to the accumulated interaction information on all circumstances. In addition, based on such network-centered idea, function-based methods, such as PEA [10], Pathway activity classification [11], GSVA [12] and Pathifier [13], are developed to obtain functional or interpretable signatures by integrating the biological knowledge (e.g., pathways) of genes into molecular network and expression information [14] (Fig. 1). However, the biological annotation deposited in databases is assembled from different resources or projects on various conditions, which makes it hard to precisely determine the actual states of particular biological functions under a specific condition, e.g. when a disease occurs to a person with certain genetic or epigenetic background.

Fig. 1
figure 1

An overview of computational methods for identifying biomarkers. a The framework of three kinds of conventional methods; b The brief comparison of our and other methods on input and output data

It is necessary to make biomarkers as a standard tool in the clinical application for precision medicine, which requires biomarkers to have not only discriminative power on samples but also clear biological interpretations [15]. In this work, we develop a novel computational framework, namely Comparative Network Stratification (CNS), to identify functional interpretable network biomarkers using gene expression, gene network and biological function together. Particularly, our biomarkers of the active genes and network structures underlying certain biological functions can better characterize diseases in terms of both discriminative power on phenotypes and readable interpretation on biological functions. To validate the effectiveness of our approach, we have compared CNS with five state-of-the-art methods (such as GSVA, Pathifier, stSVM, frSVM and AEP) on four datasets. The results suggested that CNS can simultaneously identify more discriminative network biomarkers as well as exhibit their biological interpretation in the form of network structure and function annotation. Moreover, we have also applied CNS on a case study of T1D, and provided more biological information on dysfunctional description than other methods. Therefore, CNS is actually a powerful bioinformatics tool, which can investigate functional interpretable network biomarkers in a whole transcriptome and function-centered manner.

Methods

In this section, we describe the computational framework of CNS (Fig. 2). We first introduce data pre-procession, and then present the mathematical model of comparative network stratification to extract networks based on prior-known biological functions. Finally, we apply a classification-based model to select network biomarkers.

Fig. 2
figure 2

The overview of comparative network stratification (CNS) for identifying functional interpretable network biomarkers

Data pre-procession

Given multiple states (e.g. normal and disease, or disease subtypes), the certain context-specific gene co-expression networks were first constructed before applying our CNS. Patients' expression profiles were mapped onto a biological network obtained from the STRING database (http://string-db.org/), by removing missing genes and keeping those interactions with high Pearson correlation coefficients (FDR < 0.01) between gene pairs. These state-specific networks are then integrated as a union background network G used in the following steps.

Extraction of functional interpretable network

As known, the annotations of genes or proteins from Gene Ontology (GO) are described by structured vocabulary, i.e. GO terms [16, 17]. We use the biological processes (BP) terms as prior-known functional genes' collaboration in this step. Given the above obtained background network G, a mixed integer-programming model (i.e. CNS) determines the sub-network corresponding to a certain GO term (e.g. term t), where genes show differential activities in different states (e.g. normal and disease). In other words, an optimal sub-network F = {N, E} should be a functional interpretable gene community derived from G, subject to

  1. i)

    F should be a sub-network of G;

  2. ii)

    F should be a connected graph;

  3. iii)

    F should have enrichment on the genes annotated with GO term t;

  4. iv)

    F should indicate the most active alterations between the weighted context-specific network corresponding to different states.

Such an optimization problem can be solved by flux balance process as the formula below:

$$ \begin{array}{cc}\hfill {\displaystyle \underset{x,y;o}{ \max }}\hfill & \hfill \begin{array}{l}{\displaystyle \sum_i{\displaystyle \sum_{j\ne i}\left({w_{ij}}^1\left({s}_i+{s}_j\right)-{\overline{w}}^1\right)}}{x}_{ij}+{\displaystyle \sum_i{\displaystyle \sum_{j\ne i}\left({w_{ij}}^2\left({s}_i+{s}_j\right)-{\overline{w}}^2\right)}}{x}_{ij}\\ {}+{\displaystyle \sum_i{\displaystyle \sum_{j\ne i}\left(\left|\left({e}_i^1+{e}_j^1\right){w}_{ij}^1-\left({e}_i^2+{e}_j^2\right){w}_{ij}^2\right|-\overline{d}\right)\left({s}_i+{s}_j\right){x}_{ij}}}\end{array}\hfill \\ {}\hfill s.t.\hfill & \hfill \left\{\begin{array}{l}{\displaystyle \sum_i{x}_{io}=0},o\in \left\{ seed\right\}\\ {}{\displaystyle \sum_j{y}_{oj}}={\displaystyle \sum_i{\displaystyle \sum_{j\ne i}{x}_{ij}}}-{\displaystyle \sum_j{x}_{oj}}\\ {}{\displaystyle \sum_j{y}_{ji}}-{\displaystyle \sum_k{y}_{ik}}={\displaystyle \sum_k{x}_{ik}},\forall i\\ {}{y}_{ij}\le V{x}_{ij},\forall i,j\\ {}{x}_{ij}+{x}_{ji}\le 1,\forall i,j\\ {}{x}_{ij}\in \left\{0,1\right\}\end{array}\right.\hfill \end{array} $$
(1)

Here, \( {\displaystyle \sum_i}{\displaystyle {\sum}_{j\ne i}}\left({w}_{ij}^1\left({s}_i+{s}_j\right)-{\overline{w}}^1\right){x}_{ij} \) and \( {\displaystyle \sum_i}{\displaystyle {\sum}_{j\ne i}}\left({w}_{ij}^2\left({s}_i+{s}_j\right)-{\overline{w}}^2\right){x}_{ij} \) can measure how annotative the selected sub-network is in GO term t under two conditions/states respectively. \( {\displaystyle \sum_i}{\displaystyle {\sum}_{j\ne i}}\left(\left|\left({e}_i^1+{e}_j^1\right){w}_{ij}^1-\left({e}_i^2+{e}_j^2\right){w}_{ij}^2\right|-\overline{d}\right)\left({s}_i+{s}_j\right){x}_{ij} \)describes how great the edge weights of a functional sub-network change. Besides, all the constraint conditions can make sure flux balance so as to meet the selection criteria. In details, e 1 i and e 2 i are the average expression values of gene i in two states, while w 1 ij and w 2 ij are the directional relationship strength of a gene pair or edge (i → j) in the context-specific networks. For simplicity, Pearson correlation coefficient of gene-pairs (i → j and j → i) is used as the edge weight. And \( {\overline{w}}^1 \) and \( {\overline{w}}^2 \) represent average edge strength of sub-networks. Similarly, \( \overline{d} \) is the average value of all the edge-alterations in network G. At the same time, several indicators are also necessarily defined: s i and s j are binary (i.e., 0 or 1), representing whether corresponding genes (i.e., gene i and j) are annotated by term t or not, and x ij is another indicator that x ij  = 1 if the interaction or edge (i → j) is selected in a given term, otherwise x ij  = 0.

Noted that the network flux is assumed to originate from a "seed" gene o and flow downstream into a bounded sub-network, where any node can be reachable from the seed. In such a connected sub-network, the flux balance could be defined as \( {\displaystyle {\sum}_j}{y}_{ij}-{\displaystyle {\sum}_k}{y}_{ik}={\displaystyle {\sum}_k}{x}_{ik} \), where y ji (different from w ji ) represents the value of virtual flow from j to i and \( {\displaystyle \sum_k}{x}_{ik} \) is the out-degree of node i. V is a maximum value, which can guarantee that if x ij is zero, its flow y ij also equals zero.

Identification of the functional interpretable network biomarkers

After the above optimization process, we obtained the set of active functional sub-networks corresponding to all GO terms. Thus, a network-based classification model is further proposed to identify the biomarkers from the primary disease-relevant sub-networks, according to the following defined network score.

Network score

A quantitative score is required to measure the discriminative ability of an active functional network. Specifically, the network score (NS) of a given sub-network F in one sample can be calculated via Eq.(2).

$$ N{S}_m=\frac{{\displaystyle \sum_{\left(i,j\right)\in E}\frac{{e_i}_m+{e_j}_m}{2}}}{\sqrt{\left|E\right|}} $$
(2)

where e im and e jm are the expression values of the nodes/genes i and j in a sample m when the edge/interaction (i, j) belongs to F; |E| is the total number of edges in F.

Noted, our NS is actually quantified by the expression profiles as well as related to the topology of sub-networks, consistent with the “network activity” definition in previous studies [1822].

Classification-based model

Next, using the NS to assess network activities, a classification-based model can pick out an optimal network biomarker combination [2325]. One in-house classifier was previously designed to select the minimal number of network features with great classification capacity [20]. Here, we extended this mathematical model to achieve ‘elastic’ classification by adding a correct regulization. Such a modified model is formulated as below:

$$ \begin{array}{l} \min \kern1.5em {\displaystyle \sum_m{\displaystyle \sum_i{C}_{mi}{x}_i}}+\alpha {\displaystyle \sum_i{x}_i}+\beta \varepsilon \\ {}\kern3.6em C{\left({x}_1,{x}_2,\cdots, {x}_n\right)}^T\le \varepsilon \\ {}s.t\kern2.5em {\displaystyle \sum_i{x}_i}\ge 1\\ {}\kern3.5em \varepsilon >=0\\ {}\kern3.3em {x}_i\in \left\{0,1\right\}\end{array} $$
(3)

where x i is binary (i.e., 0 or 1), indicating whether the sub-network i is selected or not; And C is a function matrix, where each element C ij representing j th network's contribution to i th sample if the sample is assigned into the correct group [20]. ε is the 'elastic' correct regulator with its value as small as possible.

In the objective function, the first term is used to characterize the classification capacity of selected biomarkers; the second term is to minimize the marker number in selection process; and the third term is to minimize the classification error; α and β are positive penalty parameters to control the trade-off within signature number, classification err and classification capacity. Certainly, α and β are chosen to obtain the best classification ratio by tuning in a reasonable scale.

In the constraint array, the first constraint is used to ensure an acceptable sample classification; the second constraint is used to guarantee at least one functional sub-network should be selected as final biomarker; and the third constraint is used to generate a reasonable correction in practice.

Results

Stronger effectiveness of CNS compared to state-of-the-art methods

To validate the effectiveness of CNS, a complete comparison scheme had been built to evaluate the performances of the conventional biomarker discovery methods and CNS on gene expression datasets. There are four datasets used in the comparison, i.e. GSE38642 (54 normal vs. 9 disease) [26], GSE18732 (47 normal vs. 45 disease) [27], GSE27342 (80 normal vs. 80 disease) [28] and GSE35713 (79 normal vs. 57 disease) [29]. The compared methods include GSVA [12], Pathifier [13], stSVM [6], frSVM [5] and AEP [10]. As stSVM, frSVM and AEP have been integrated into the netClass package [1], we used the netClass package directly. Due to GSVA and Pathifier implemented without feature selection, we select the same number biomarkers as identified by CNS through SVM-RFE [30].

As proposed, good biomarkers or signatures should have more discrimination ability as well as more interpretable biological sense. Thus, we used two criterions to evaluate the identified biomarkers (see Additional files 1, 2, 3 and 4) respectively: classification accuracy and functional interpretability.

Improved discrimination of CNS evaluated by classification accuracy analysis

We used SVM to calculate the classification accuracy of the biomarkers identified by all methods, employing five-fold cross-validation. The performance of different methods on four datasets were shown as ROC curves (Fig. 3). And the AUC corresponding to these ROC curves were reported in Table 1. These results illustrate CNS biomarkers have the most stable and best classification accuracy than those identified by other methods. Note that, we can see, in the low false positive regions, the true positive rate of CNS is lower than some of other methods, that might be caused by the trade-off between specificity and sensitivity of classification approaches. It would be valuable to further improve the accuracy by careful feature selection or classifier building in future work.

Fig. 3
figure 3

The performance of all compared methods across multiple datasets in classification accuracy analysis. The ROC curves of the compared methods on a GSE38642, b GSE18732, c GSE27342, d GSE35713 datasets

Table 1 The AUC of six methods on four datasets

Improved biological sense of CNS evaluated by functional interpretability analysis

In order to further evaluate the interpretability of these methods on biological functions, we made an association analysis to measure the relationship between the biomarkers and the studied disease, as the functional interpretability power of those biomarkers. And the association degree is evaluated by the proportions of disease-associated genes (DAGs) in all selected functional biomarkers. Here, the DAGs are the intersection of the differential expressed genes (DEGs) identified by t-test and genes associated with the studied disease taken from GeneCard [31]. Meanwhile, the biomarkers identified by frSVM and stSVM are general gene sets rather than functional gene groups by GSVA, Pathifier and AEP; that’s why we first functionally label them with the top enriched terms by g:Profiler [32].

The enrichment ratio boxplots are shown in Fig. 4. It seems CNS has the best performance on every dataset; frSVM has varied performance than other conventional approaches showing its dependency on the context of analyzed datasets; while other methods have nearly the same performances across all datasets. This fact suggests many previous methods would consider less on the functional interpretability, so that they tend to have the similar lower performances; and CNS actually promote the interpretability of identified biomarkers on biological functions as proposed.

Fig. 4
figure 4

The performance of all compared methods across multiple datasets for functional interpretability analysis. The “*” represents the mean value of the functional interpretability power for each method and “-” indicates the median value. The functional interpretability analysis of the compared methods on a GSE38642, b GSE18732, c GSE27342, d GSE35713 datasets respectively

Furthermore, we also computed the P-values to measure CNS improved performance compared to other methods, and shown in Table 2. Obviously, CNS has much better performance on the functional interpretability overall.

Table 2 The significance of better performance on functional interpretability comparison

A proof-of-concept study by CNS

The biomarkers identified by CNS not only determine specific biological functions related to diseases, but also reveal the active network-structure underlying such dysfunctions. These characteristics of biomarkers just make that CNS have improved ability to understand complex diseases. To further illustrate this advantage of CNS, we have made a proof-of-concept study on GSE3571 dataset of T1D (i.e. 79 normal samples and 57 T1D samples).

Dys-regulation analysis of biological functions enriched in biomarkers

Biomarkers would play important roles on the development and progression of complex diseases, and they would be dysfunctional among different disease states. The DEGs proportion is regarded as an effective standard to estimate the dysfunction degree of the biomarkers. Thus, the enrichment ratios of DEGs in each biological functions were calculated to represent the dysfunction degree of our biomarkers. The results were shown in Fig. 5, and CNS obtains the biomarkers with significant dysfunctional signal again.

Fig. 5
figure 5

The performance of the compared methods for dysfunction analysis. The “*” represents the mean value of the functional interpretability power for each method and “-” indicates the median value

On one hand, the traditional function-based methods (GSVA, Pathifier and AEP) and CNS all selected function biomarkers. The traditional methods assumed to consider all genes of each biological functions and transform gene expressions to some meta-values with good distinguishable capacity. Different to them, CNS recognized the representative genes of each biological function, which would promote the enrichment of DEGs, and reflect more reasonable dysfunction interpretation.

On the other hand, the traditional expression-based and network-based methods are designed to identify gene sets or sub-networks as biomarkers. Although these biomarkers could contain many DEGs, these enriched genes could be isolated and cannot provide readable functional interpretation. According to these results of DEGs enrichment for biomarkers, the performance of the traditional expression-based and network-based methods may be slightly better than traditional function-based methods, but still be less than CNS.

Significance evaluation of dysfunction in biomarkers

Besides dysfunction degree measurement, it is also necessary to evaluate the level of significance of those dysfunctions in the biomarkers. A hyper-geometric test is used to compute the P-value of the enrichments of DEGs for a biomarker/a biological function.

$$ P=1-{\displaystyle \sum_{k=0}^{X-1}\frac{\left(\begin{array}{c}\hfill G\hfill \\ {}\hfill k\hfill \end{array}\right)\left(\begin{array}{c}\hfill R-G\hfill \\ {}\hfill T-k\hfill \end{array}\right)}{\left(\begin{array}{c}\hfill R\hfill \\ {}\hfill T\hfill \end{array}\right)}} $$
(4)

where R is the number of all genes, T is the number of genes in the biomarker/the biological function; G is the number of DEGs; X is the number of DEGs enriched in the biomarker/the biological function.

Given a threshold of statistic significance, we define those biomarkers/biological functions with less P-values as significant ones. And then the ratios of these significant biomarkers/functions in identified biomarkers under the varied thresholds are shown in Fig. 6. Obviously, CNS is more effective to obtain dysfunction-explainable biomarkers than other methods.

Fig. 6
figure 6

Percentages of dysfunctions obtained under different thresholds of significance

Biological significances of dysfunctions of biomarkers

The most significant dysfunction of ten biomarkers/functions identified by CNS are listed in Table 3, five of which are related to T1D as reported in literatures. These biomarkers include: two functions direct correlated with T1D (e.g. regulation of insulin secretion and regulation of gluconeogenesis [31]), and three functions relevant with T1D complications (e.g. positive regulation of cytokine-mediated signaling pathway [33], positive regulation of response to cytokine stimulus [34, 35], establishment of T cell polarity [3537]).

Table 3 The biological sense of dysfunction of biomarkers

P-values_Dis evaluates the significance of the disease genes enriched in biomarkers; P-values_Diff evaluates the significance of DEGs enriched in the biomarkers; R represents whether one biomarker is known related with T1D or its complications; Y denotes the biomarker is known associated with T1D or its complications; N denotes the biomarker is unclearly associated with T1D

A case study on the network biomarkers

Insulin secretion dysfunction is known to be a key characteristic of T1D. And such regulation function identified in our biomarkers is further analyzed to see how its topological structure changes and gene expression perturbations.

The overall and state-specific network structures of “the regulation of insulin secretion” were shown in Fig. 7. The network topology under normal condition is very similar to background network structure, while the network under T1D loses much functional completeness (Fisher’s Exact Test P-value is 7.31E-06). In particular, as Nils Billestrup et al. ever reported, SOCS inhibited IRS in diabetic patients [38], which agreed with their disappearance in our T1D active network structure. Meanwhile, our network biomarker has 46 genes including 39 DEGs, which means that the network biomarker can represent an active functional module/unit under different states. But when we check the same function in the biomarkers identified by other five methods, the real active genes are more likely drown out in a huge gene set. The function of “the regulation of insulin secretion” is probably missed by network-based methods, because the frSVM biomarkers (94 genes) and stSVM biomarkers (445 genes) have only 4 and 14 genes enriched in this biological function. Besides, for those function-based methods (GSVA, Pathifier and AEP), the real active network structure of this regulation function is also incomplete in their biomarkers, where only 23 DEGs are covered in 55 genes. All the comparison results further show CNS, different from traditional methods, can recognize the representative genes of biological functions, which would promote the enrichment of differential expressed genes, and reflect more readable interpretation on biological dysfunctions.

Fig. 7
figure 7

The topology alterations of the network biomarker. a The background network is the biomarker of “The regulation of insulin secretion”, which is a union network of (b) normal-specific active sub-network and (c) disease-specific active sub-network identified by CNS. Nodes and edges in black present genes and interactions are selected under a certain state; Nodes and edges in gray mean genes and interactions are background features and not selected

Discussion and Conclusion

The complex diseases are usually thought to be caused by the dysfunction on the molecular interaction network (e.g. protein-protein interactions). Conventional expression-based and network-based methods widely use genes and their networks to capture the biomarkers without clear complete biological interpretation due to subsequent function enrichment in the whole gene pool. While, the function-based analysis just solved this problem by integrating biological functions (e.g. pathways or GO terms) into gene sets at the beginning. However, rather than a list of disease associated functions, detailed alterations (e.g. topological structure) in such functions clearly are more valuable to precision medicine study to some extent.

With respect to the problems, we proposed a novel computational framework, named as comparative network stratification (CNS), to identify active sub-networks as functional interpretable network biomarkers, which have both discriminative power on disease states and readable interpretation on biological functions. Actually, we previously also developed a method as Network Stratification Analysis (NetSA) [39] to decompose an context-specific biological network into many function-specific network modules. CNS analyzed the characteristics of two or more states of complex diseases based on the same background biological network. Using CNS, the extended method of NetSA, we tessellated two-states' context-specific networks and identified the active network structures of biological functions under normal and disease conditions, respectively.

To validate the effectiveness of such a new approach, we have compared CNS with five state-of-the-art methods (such as GSVA, Pathifier, stSVM, frSVM and AEP) on four datasets of complex diseases. These results demonstrate CNS considers the representative genes and their networks in biological functions, and thus it can have better discrimination, better enrichment on disease-associated genes and better enrichment on differential expressed genes simultaneously. Therefore, CNS will be actually a powerful bioinformatics tool to investigate functional interpretable network biomarkers in a whole transcriptome and function-centered manner.

Besides, we mainly focus on the discrimination between normal and disease, which can be expanded to distinguish multiple diseases in future work. In fact, in our experiments, we have analysis on type 1 diabetes (GSE35713) and type 2 diabetes (GSE38642) in this paper. As known to us, type 1 diabetes (T1D) and type 2 diabetes (T2D) are two subtypes of Diabetes and have similar disease genes from the GeneCard database including 5066 and 4970 disease genes, respectively. However, the number of the identified biomarkers of T1D and T2D are 91 and 164 respectively, and the overlap of them only contains 7 biomarkers. Meanwhile, Figs. 4d and a both show the functional interpretability of the identified biomarkers of T1D and T2D respectively. Based on these results, we could find that the functional interpretability of the identified biomarkers in T1D and T2D is indeed different, and this fact suggests that the different genes or functions form the different biomarkers for T1D and T2D. Therefore, the identified functional biomarkers are sensitive and specific on the diseases.

In addition, considering the functional containment relationships or the ancestors-descendants relationships of the biological functions in GO database, it is necessary to remove redundant functional interpretations in the network biomarkers in the future work around CNS.