StressGenePred is an integrated analysis method of multiple stress time-series data. StressGenePred (Fig. 2) includes two submodels : a biomarker gene discovery model (Fig. 3) and a stress type prediction model (Fig. 4). To deal with the high-dimension and low-sample-size data problem, both models share a logical correlation layer with the same structure and the same model parameters. From a set of transcriptome data measured under various stress conditions, StressGenePred trains the biomarker gene discovery model and the stress type prediction model sequentially.
Submodel 1: biomarker gene discovery model
This model takes a set of stress labels, Y, and gene expression data, D, as input, and predicts which gene is a biomarker for each stress. This model consists of three parts: generation of an observed biomarker gene vector, generation of a predicted biomarker gene vector, and comparison of the predicted vector with the label vector. The architecture of the biomarker gene discovery model is illustrated in Fig. 3, and the process is described in detail as follows.
Generation of an observed biomarker gene vector
This part generates an observed biomarker vector, Xk, from gene expression data of each sample k, Dk. Since each time-series data is measured at different time points under different experimental conditions, a time-series gene expression data must be converted into a feature vector of the same structure and the same scale. This process is called feature embedding. For the feature embedding, we symbolize the change of expression before and after stress treatment by up, down, or non-regulation. In detail, a time-series data of sample k is converted into an observed biomarker gene vector of length 2n, Xk={xk1,…,xk2n}, where xk2n−1∈{0,1} is 1 if gene n is down-regulation or 0 otherwise, xk2n∈{0,1} is 1 if gene n is up-regulation or 0 otherwise. For determining up, down, or non-regulation, we use the fold change information. First, if there are multiple expression values measured from replicate experiments at a time point, the mean of expression values is calculated for the time point. Then, the fold change value is computed by dividing the maximum or minimum expression values for a time-series data by the expression value at first time point. After that, the gene whose fold change value >0.8 or <1/0.8 is considered as up or down regulation gene. The threshold value of 0.8 is selected empirically. When the value of 0.8 is used, the fold change analysis generates at least 20 up or down regulation genes for all time-series data.
Generation of a predicted biomarker gene vector
This part generates a predicted biomarker gene vector, \(X^{\prime }_{k}\), from stress type label Yk. \(X^{\prime }_{k}=\{x^{\prime }_{k1}, \ldots, x^{\prime }_{2kn}\}\) is a vector of the same size as the observed biomarker gene vector Xk. The values of Xk` means up or down regulation as same as Xk. For example, xk2n−1=1 means gene n is predicted as a down-regulated biomarker, or xk2n=1 means gene n is predicted as a up-regulated biomarker, for a specific stress Yk.
A logical stress-gene correlation layer, W, measures the weights of association between genes and stress types. The predicted biomarker gene vector, \(X_{k}^{\prime }\), is generated by multiplying stress type of sample k and the logical stress-gene correlation layer, i.e., Yk×W. In addition, we use the sigmoid function to summarize the output values between 0 to 1. The stress vector, Yk, is encoded as one-hot vector of l stresses, where each element indicates whether the sample k is each specific stress type or not. Finally, the predicted biomarker gene vector, \(X_{k}^{\prime }\), is generated like below:
$$\begin{array}{*{20}l} X^{\prime}_{k} = sigmoid(Y_{k} \times W) &= \frac{1}{1+exp(-Y_{k} \times W)} \\[0.4em] where ~~ W &= \left(\begin{array}{llll} w_{11} & w_{12} & \ldots & w_{1n} \\ \ldots & \ldots & \ldots & \ldots \\ w_{l1} & w_{l2} & \ldots & w_{ln} \end{array}\right) \end{array} $$
The logical stress-gene correlation layer has a single neural network structure. The weights of the logical stress-gene correlation layer are learned by minimizing the difference between observed biomarker gene vector, Xk, and predicted biomarker gene vector, \(X^{\prime }_{k}\).
Comparison of the predicted vector with the label vector
Cross-entropy is a widely-used objective function in logistic regression problem because of its robustness to outlier-including data [12]. Thus, we use cross-entropy as the objective function to measure the difference of observed biomarker gene vector, Xk, and predicted biomarker gene vector, \(X^{\prime }_{k}\), as below:
$$\begin{array}{*{20}l} loss_{W} = & - \sum\limits^{K}_{k=1} \left(X_{k} log (sigmoid(Y_{k}W)) \right.\\[-0.5em] &\left. \;\; + (1 - X_{k}) log (1-sigmoid(Y_{k}W)) \right) \end{array} $$
By minimizing the cross-entropy loss, logistic functions of the output prediction layer are learned to predict the true labels. Outputs of logistic functions can predict that a given gene responds to only one stress or to multiple stresses. Although it is natural for a gene to be involved in multiple stresses, we propose a new loss term because we aim to find a biomarker gene that is specific to a single stress. To control relationships between genes and stresses, we define a new group penalty loss. For each feature weight, the penalty is calculated based on how much stresses are involved. Given a gene n, a stress vector gn is defined as gn=[gn1,gn2,...,gnl] with l stresses and gnl=max(wl,2n,wl,2n+1). Then, the a group penalty is defined as \((\sum (g_{n}))^{2}\). Since we generate the output with a logistic function, gnl will have a value between 0 and 1. In other words, if gn is specific to a single stress, the group penalty will be 1. However, if the gene n reacts to multiple stresses, the penalty value will increase quickly. Using these characteristics, the group penalty loss is defined as below:
$$loss_{group} = \alpha \sum\limits^{N}_{n=1} \left(\sum\limits^{L}_{l=1} g_{nl}\right)^{2}$$
On the group penalty loss, hyper-parameter α regulates effects of group penalty terms. Too large α imposes excessive group penalties, so genes that respond to multiple stresses are linked only to a single stress. On the other hand, if the α value is too small, most genes respond to multiple stresses. To balance this trade-off, we use well-known stress-related genes to allow our model to predict the genes within the top 500 biomarker genes at each stress. Therefore, in our experiment, the α was set to 0.06, and the genes are introduced in “Ranks of biomarker genes and the group effect for gene selection” section.
Submodel 2: stress type prediction model
From biomarker gene discovery model, the relationships between stresses and genes are obtained by stress-gene correlation layer W. To build stress type prediction model from feature vectors, we utilize the transposed logical layer WT and define a probability model as below:
$$A_{k} = sigmoid \left(X_{k} W^{T}\right)$$
$$A_{kl} = sigmoid \left(\sum\limits^{N}_{i=1} x_{ki} w_{il} \right) $$
Matrix W is calculated from a training process of the biomarker gene discovery model. Ak means an activation value vector of stress types, and it shows very large deviations depending on the samples. Therefore, normalization is required and performed as below:
$$A^{norm}_{k} = \frac{A_{k}}{\sum\limits^{N}_{n}{x_{kn}}} $$
For the logistic filter, these normalized embedded features vectors encapsulate average weight stress-feature relationship values that reduce variances among the vectors with different samples. As another effect of the normalization, absolute average weights are considered rather than relative indicator like softmax. So, false positive rates of predicted stress labels can be reduced. Using the normalized weights \(A^{norm}_{k}\), logistic filter is defined to generate a probability as below:
$$g_{k}(A^{norm}_{k}) = \frac{1}{1+b_{l} \times exp(A^{norm}_{k}-a_{l})} $$
where a and b are general vector parameters of size L of logistic model g(x).
Learning of this logistic filer layer is started with normalization of the logistic filter outputs. This facilitates learning by regularizing the mean of the vectors. Then, to minimize loss of positive labels and entropy for negative labels, we adopted the Confident Multiple Choice Learning(CMCL) loss function [13] for our model as below:
$$\begin{array}{*{20}l} loss_{CMCL} &(Y_{k}, g(A^{norm}_{k})) = \\ &\sum\limits^{K}_{k=1} \left((1-A^{norm}_{k})^{2} - \beta \sum\limits^{L}_{l \neq Y_{k}} log(A^{norm}_{k}) \right) \end{array} $$
To avoid overfitting, a pseudo-parameter β is set by recommended setting from the original CMCL paper [13]. In our experiments, β=0.01≈1/108 is utilized.