Background

Transcriptomic profiling is important in understanding how genes regulate biological functions and control the underlying mechanisms of diseases. Differential expression (DE) analysis is used to identify the genes which undergo changes in biological patterns across healthy and disease conditions. This analysis can help researchers identify the differentially expressed genes (DEGs) which behave differently in disease conditions and help them prioritize these condition-specific genes as potential biomarkers for a particular disease. Numerous parametric and non-parametric statistical methods have been developed for DEG analysis based on logarithmic values of fold change (logFC) of gene expression in control state to disease state [1]. For instance, DESeq [2], DESeq2 [3], edgeR [4], and voom [5] use variance (dispersion) in gene expression values to identify the DEGs. However, due to the biases incorporated during computation of dispersion results, high false positive and false negative rates occur in predicting DEGs from RNA-seq data. Recently, many machine learning (ML) methods have been developed to classify genes on the basis of gene expression. For example, Support Vector Machine (SVM) with mutual information was used to classify genes that distinguish colon cancer patients from healthy patients [6]. Similarly, Logistic Regression (LR) has been used to classify gene expression from microarray experiments between acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML) of Golub leukemia data, and between cervical cancer and normal tissues [7]. A Random Forest based method was proposed to classify genes in microarray data [8]. Additionally, an empirical study was carried out to assess various state-of-the-art supervised ML methods, namely Decision Tree (DTC), Linear Regression(LR), Naïve Bayes (NB), Random Forest (RFC), Support Vector Machine (SVC) in classifying gene expression in RNA-seq datasets [9]. However, these ML methods required selection of gene features as prior knowledge to train the classifier.

Deep learning (DL) coupled with transfer learning, on the other hand, has the ability to classify novel data by directly learning complex non-linear relationships among the features of the training data in one end-to-end classification system [10]. A Convolution Neural Network (CNN) is a type of deep learning, which applies mathematical convolutional approaches in one of more internal layers of the network. CNNs have architectures which enable massive computations and learning of non-linear relations between input and output data [11]. It has been recently demonstrated that the CNN is a powerful tool for classification in both image and non-image data because of characteristics, such as feature extraction, efficient hierarchical filtering with internal layers to deeply train a model, weight sharing capability to mitigate memory requirements, and utilization of neighborhood information [12,13,14,15,16,17]. For example, DeepInsight [17] is a CNN-based model, which has also shown promising results in feature extraction from non-image data, such as gene expression, text data, or synthetic data. Application of DL to RNA-seq data is challenging due to absence of appropriate labels and smaller sample sizes (n) as compared to the number of genes (g) [18]. Kakati et al. [19], proposed the first DL-based method to predict upregulating (UR) and downregulating (DR) genes from RNA-seq breast cancer and Parkinson’s disease datasets. Additionally, recent papers [20,21,22] have reported the use of transfer learning to predict cancer types or survival of cancer patients. Moreover, recently, many modified versions of ML based methods, such as SVC, RFC, and DTC have been reported which use learned knowledge to implement transfer learning between different domains of images [23,24,25,26,27]. However, with the increase in gene-expression data availability, there is the opportunity to create a generalized model which can use the trained features to identify potential biomarkers from UR and DR genes from small or large untrained datasets. To the best of our knowledge, currently there is no Convolutional Neural Network (CNN)-based model coupled with transfer learning to predict the gene expression directionality from both trained and untrained RNA-seq datasets.

In this paper, we propose a robust CNN-based model, DEGnext, in conjunction with transfer learning to classify the UR and DR genes from RNA-seq cancer datasets.

Results

In this section, we report the experimental results of DEGnext for both general and transfer learning. In the Methods section , we describe each step of the DEGnext workflow and CNN architecture in detail, so we will only give a brief overview here to aid in the interpretation of the results. Fig. 1 contains the workflow of DEGnext, while in Fig. 2, we illustrate the CNN architecture that we used in DEGnext to train and classify test data as UR and DR genes.

Fig. 1
figure 1

Workflow of DEGnext methodology. In this workflow, there are three main phases. The first phase involves data collection, preprocessing, labelling, and splitting of the data. Here, we split the data into two parts: non-biologically validated data (“non-bio data” or P) and the biologically validated data (“bio data” or Q). T1 is the non-biologically validated train data of P (“non-bio train data”, 80% of P). T2 is the non-biologically validated test data of P (“non-bio test data”, 20% of P). F1 is the fine-tune data of biologically validated data (“fine-tune data”, 80% of Q). T3 is the biologically validated test data of Q (“bio-test data”, 20% of Q). The second phase includes training (first level training) and fine-tuning (second level training) and testing of CNN model to predict UR and DR genes. The third phase includes downstream enrichment analyses of the predicted UR and DR genes to identify potential biomarkers related to a cancer dataset. The CNN architecture is illustrated in Fig. 2

Fig. 2
figure 2

CNN architecture of DEGnext. The input to the model is a 1D input vector (\(x_{1}\), \(x_{2}\), \(\dots\), \(x_{n}\)), which represents each gene row of a cancer dataset. This 1D vector is converted to a 2D matrix of channel 1 using np.reshape(). We used a sequence of eight 2D convolutional neural network (CNN) layers (\(C_{1}\), \(C_{2}\), \(\dots\), \(C_{8}\)) with ReLU() as activation function. Each CNN layer uses kernel-size (3, 3), stride of 1, and padding equal to 1. We used a 2D Maxpool layer of kernel-size 2. In order to make the model inclusive for any input size, we used a 2D AdaptiveMaxPool layer with target output size of 1 \(\times\) 1. The output of the CNN layers is fed to a sequence of 5 linear layers (\(L_{1}\), \(L_{2}\), \(\dots\), \(L_{5}\)) with ReLU() as activation function. We used Softmax() to the output of linear layers, to find the probabilities of each class in the range of [0, 1]

Table 1 Dataset abbreviations for cancer datasets used in DEGnext

We used TCGABiolinks R package [28] to download 17 datasets (listed in Table 1).

For labeling the genes in the datasets, we used logFC values in addition to disease-related knowledge from Ingenuity Pathway Analysis (IPA) to divide the data into biologically or non-biologically validated data. First, we use general learning on DEGnext, to predict the directionality of DEGs from all 17 datasets. Second, we divide 17 datasets into 9 training datasets and 8 testing or untrained datasets and use transfer learning to leverage the knowledge (features, weights) acquired from the previously trained DEGnext model to predict UR and DR genes from rest 8 testing datasets. Third, we evaluate the performance of the DEGnext model for general and transfer learning against five ML methods, namely Decision Tree (DTC), Linear Regression(LR), Naïve Bayes (NB), Random Forest (RFC), Support Vector Machine (SVC), and XGBoost in terms of accuracy, recall, precision, F-measure, Matthews correlation coefficient (MCC), and Receiver Operating Characteristic (ROC) scores. Fourth, we test the robustness of DEGnext by augmenting the datasets with seven levels of Gaussian noise data (1, 10, 50, 100, 500, 1000, 1500) and compare it with other ML methods. Fifth, we obtain the Gene Ontology (GO) term enrichment and pathway enrichment of the predicted up/down regulated genes from cancer datasets. Finally, we identify the potential biomarkers mapped to the significant pathways related to BRCA and UCEC datasets. Throughout the text and figures, we will be using the following abbreviations for the testing and training data. The non-biologically validated data is labeled as P (“non-bio data”), and the biologically validated is labeled as Q (“bio data”). The P data were split as non-biologically validated train data (“non-bio train data” or T1) and non-biologically validated test data (“non-bio test data” or T2) in the ratio of 80:20. Similarly, the Q data were split in the ratio of 80:20 as fine-tune (F1) and biologically-validated test data (“bio-test data” or T3), respectively.

Performance of DEGnext in general learning experiment

In the general learning experiment, we first trained the model with non-bio train data (T1) for all 17 datasets with fivefold cross-validation. Then, we took the best fold models for each dataset and further trained the model with fine-tune data (F1). To evaluate the overall DEGnext output quality in general learning, we calculated the mean of five metrics, namely accuracy, recall, precision, F-measure, and MCC across fivefolds of bio-test data (T3). In Table 2, we report the performance of DEGnext on bio-test data (T3) for all 17 datasets. We find that the mean of all the three metrics, namely accuracy, recall, and precision scores for all the datasets was within the range of 95-100\(\%\). The F-measure and MCC scores for all the 17 datasets were above 0.85. This demonstrates that the proposed model is effective in classifying the up/down regulated genes from bio-test data of the respective TCGA cancer datasets. For instance, the MCC score for datasets CHOL, KICH, KIRC, KIRP, LUAD, LUSC, and THCA is 1, which signifies perfect prediction of up/down regulated genes by DEGnext.

Table 2 Performance of DEGnext on bio-test data (T3) of all 17 datasets using general learning considering fivefold cross validation

Performance of DEGnext in transfer learning

General learning results for DEGnext are nearly perfect. However, most RNA-seq datasets do not have appropriate labels and have smaller sample sizes (n) compared to number of genes (g). In those situations, a general model cannot be obtained using supervised learning and we must rely on models trained on another datasets. This motivated us to make DEGnext generalizable for RNA-seq datasets, irrespective of dataset size or appropriate labels. Instead of training a CNN from scratch for any new dataset, we wanted to use the pretrained DEGnext model on new datasets without labels to predict significant UR and DR genes. Transfer learning can leverage the knowledge of trained feature maps from trained model to untrained cancer datasets . Moreover, general learning on 17 datasets is time-consuming since we need to train and fine-tune the model for each dataset separately to predict UR and DR genes from bio-test data of respective dataset. In order to analyze the effectiveness of DEGnext in transfer learning, we divided the 17 datasets into two groups based on the sample sizes: 9 training datasets with large size (BRCA, LIHC, LUAD, LUSC, KIRC, KIRP, PRAD, THCA, and UCEC) and 8 testing or untrained datasets with smaller size (BLCA, CHOL, COAD, ESCA, HNSC, KICH, READ, and STAD). We used this dataset splitting strategy so that features learned during training generalized to any unknown or new dataset regardless of size. The testing datasets comprised of 100\(\%\) of biologically validated data (Q). Since the testing datasets are smaller in size than the training datasets, we first use non-bio train data (T1) and fine-tune data (F1) to train DEGnext model on all 9 training datasets sequentially. The trained model is then tested on biologically validated data (Q) of the untrained datasets to predict UR and DR genes from untrained datasets.

From Table 3, we show that except for COAD and READ, all other TCGA untrained datasets, attain an overall performance above 84\(\%\) when using our DEGnext model on biologically validated data (Q).

Table 3 Performance of DEGnext on biologically validated data (Q) of 8 testing or untrained datasets using transfer learning

The MCC scores for these datasets, such as BLCA, CHOL, HNSC, KICH, and STAD was above 0.9, which signified the better prediction of UR and DR genes from the untrained datasets. This process leverages the optimization and reduces the amount of data and time required to train new models for new datasets. Thus, we conclude that DEGnext was able to transfer the knowledge of learned feature maps from the trained datasets to untrained datasets effectively.

Comparison of DEGnext performance with other ML methods

We assessed the performance of DEGnext for both general and transfer learning in predicting UR and DR against five ML methods. In particular, we compared DEGnext to DTC, KNC, RFC, SVC, and XGBoost in terms of mean of accuracy, recall, precision, F-measure, MCC, and ROC scores. For general learning, we trained and fine tuned DEGnext and the ML methods with non-bio train data (T1) and fine-tune data (F1) for each dataset. Then, we tested the models with bio-test data (T3) of all 17 datasets with fivefold cross validation. In Additional file 1: Table S1, we see that DEGnext’s results on general learning are competitive with the other ML methods for all 17 datasets. In Fig. 3, ROC scores for all 17 datasets were 0.96 or above. Similarly, we found XGBoost outperformed for most of the datasets as compared to other traditional ML-methods, such as DTC, KNC, and SVC.

Fig. 3
figure 3

ROC curves of 17 datasets for general learning. Comparison of ROC scores for general learning of bio-test data (T3) for all 17 datasets

For transfer learning, we trained DEGnext and the ML methods on 80\(\%\) of non-bio train data (T1) and 80\(\%\) of fine-tune data (F1) of the 9 training datasets sequentially. We then tested the pretrained model on 100\(\%\) of the biologically validated data (Q) of the untrained datasets. In Additional file 1: Table S2, we show that except for COAD and READ datasets, DEGnext’s results are consistent with those of the other ML methods in terms of accuracy, recall, precision, F-measure, MCC, and ROC-scores for all the untrained datasets. In Fig. 4, the ROC scores were above 0.85 for DEGnext and XGBoost for all testing datasets in line with to other ML methods. Therefore, from our results, we can conclude that DEGnext is competitive or better in both general learning and transfer learning.

Fig. 4
figure 4

ROC curves of 8 untrained datasets for transfer learning. Comparison of ROC scores for transfer learning on 100\(\%\) of bio data data (Q) for all test datasets, namely BLCA, CHOL, COAD, ESCA, HNSC, KICH, READ, and STAD

Robustness of DEGnext

To demonstrate that DEGnext is robust to noise, we tested the performance of DEGnext with increasing Gaussian noise in 7 levels: 1\(\%\), 10\(\%\), 50\(\%\), 100\(\%\), 500\(\%\), 1000\(\%\), and 1500\(\%\). In Fig. 5A, we show that for all the datasets, DEGnext was quite robust to the increasing levels of noise up to 500\(\%\) standard deviation. However, for READ, COAD, and UCEC datasets, the accuracy decreased with the increase levels of noise. On the other hand, in Fig. 5B, we show that DEGnext performs competitively better in terms of the mean accuracy for bio-test data (T3) of all the 17 datasets.

Fig. 5
figure 5

Robustness comparison for DEGnext. A Robustness of DEGnext to noisy data for bio-test data (T3) of all the 17 datasets. B Comparison with other ML methods in terms of the mean accuracy for bio-test data (T3) of all the 17 datasets in presence of different levels of Gaussian noise

GO enrichment analysis of predicted UR and DR genes

After classifying the DEGs into UR and DR genes using our DEGnext model, we assessed the GO enrichment of the predicted UR and DR genes using ToppGene Suite. As shown in Table 4, the predicted UR and DR genes were enriched with some common GO terms associated with carcinogenesis.

Table 4 Analysis of GO enrichment of predicted UR and DR genes for BRCA and UCEC datasets

For instance, we observed that the predicted UR and DR genes from BRCA datasets were related to GO terms such as cellular adhesion and cell morphogenesis, which are associated with cancer cell invasion and metastasis [29]. On the other hand, for the UCEC dataset, the GO terms mapped from the predicted UR and DR genes were mainly focused on reproductive processes, reproduction, and positive regulation of plasminogen activation and had more significant p-values and q-values. There is evidence [30] that activation of plasminogen from cancer cells leads to breakdown of cellular components, which in turn leads to invasion of cancer cells into other areas of the body. These results suggest that the predicted UR and DR genes for the breast and uterine cancer datasets were functionally enriched with significant GO terms with lower p-values and q-values associated with cancer.

Pathway enrichment analysis of predicted UR and DR genes

Table 5 Ten significant pathways mapped from predicted UR and DR genes of BRCA and UCEC

We performed a pathway enrichment analysis of the predicted UR and DR genes obtained from the bio-test data of BRCA and UCEC datasets using IPA. In Table 5, we report 10 significant pathways, mapped from predicted UR and DR genes of BRCA and UCEC, associated with progression of breast and cancer datasets. We discuss these below.

  1. 1

    BRCA dataset

    1. (a)

      RhoGDI signaling: The main functions of the Rho family of GTPase involve promoting cellular adhesion, proliferation, and metastasis of breast cancer cells. RhoB exerts positive effects on increasing expression of estrogen receptor alpha (ER\(\alpha\)) and progesterone receptor (PR), which correlate to the progression of breast cancer [31]. Pathway analysis shows that predicted upregulating genes such as RHOQ, CDH5, and FNBP1 are associated with signaling by Rho family GTPase. There is evidence in prior research that Cadherin-5 (CDH5) is a potential biomarker for metastasis of breast cancer [32]. The modulation of RHOB and RHOG regulates the proliferation and differentiation of cancer cells, which influences the prognosis of breast cancer [33].

    2. (b)

      ILK pathway: Activation of oncogenes leads to overgrowth of cancer cells, which is the hallmark of the progression of a malignant tumor, like breast cancer. Over-expression of integrin-linked-kinase (ILK) promotes proliferation and growth of breast cancer cells [34]. Under normal conditions, ILK is involved in adhesion of cells, homeostasis of tissue and other critical cardiac functions. It has been found that upregulation of ILK leads to significant acceleration of tumor development in breast cancer. From the bio-test gene list, we find that ITGB7, which is predicted to be up-regulated by DEGnext is responsible for altered ILK pathway. It leads to abnormal cell proliferation in breast cancer [35]. Similarly, DEGnext predicts upregulation of RHOB, which has been reported to exert positive effects during carcinogenesis of breast tumors [33].

  2. 2

    UCEC dataset

    1. (a)

      PTEN signaling: We found that within the significant PTEN signaling pathway that some mapped genes, such as ITGA4, MCRS1, and SOS1 were predicted to be upregulated by the DEGnext model. In [36], it has been reported that ITGA4 is a potential target for carcinogenesis because overexpression of ITGA4 promotes invasion of tumor cells and metastasis. Similarly, in [37], it was found that genes like MCRS1 are overexpressed in the advanced stage of cervical cancer. Additionally, in [38], the authors confirmed that overexpression of the SOS1 gene correlates with the progression of cancer.

    2. (b)

      Ephrin receptor signaling: In a significant pathway called the Ephrin Receptor Signaling pathway, we found several mapped genes, namely ATF4, EPHA6, ITGA4, and SOS1, which were predicted to be upregulated by DEGnext and were related to carcinogenesis. For example, the gene Erythropoietin-producing human hepatocellular (EPH) receptors, such as EPHA6 has pro-tumorigenic effects and induces a number of cellular processes, such as adhesion, proliferation, differentiation during carcinogenesis of cervical cancer [39].

Discussion

We developed a CNN model called DEGnext to classify UR and DR genes from the DEGs of TCGA RNA-seq cancer datasets. We established that transfer-learning combined with the DEGnext model made the model effective in classifying UR and DR genes from untrained datasets. We compared the performance of DEGnext with 5 other ML methods, and DEGnext is competitive in terms of accuracy, sensitivity, specificity, F-measure, MCC, and ROC scores. In particular, for general learning, in Fig. 3 and in Additional file 1: Table S1, we show that DEGnext performs competitively or better than other existing ML methods for all 17 datasets. For transfer learning, in Fig. 4 and in Additional file 1: Table S2, we show that except for the COAD and READ datasets, DEGnext prediction results are better than existing ML methods. For the COAD and READ datasets, a similar discrepancy in precision and recall has been reported in the results for models based on CNNs [12, 13]. The DEGnext model was robust in terms of accuracy and was able to withstand the addition of Gaussian noise.

We validated the biological enrichment of the predicted UR and DR genes from the BRCA and UCEC datasets in terms of GO and pathway enrichment. We found that the predicted UR and DR genes were enriched with GO terms related to cancer with significant p-values and q-values. Similarly, for biological pathways, we found that the predicted UR and DR genes were enriched in pathways associated with breast cancer, such as the ILK pathway and the Rho GTPase signaling pathway. Pathways mapped from the predicted UR and DR genes of the UCEC dataset also play significant roles in carcinogenesis of cervical cancer such as PTEN signaling and Ephrin receptor signaling pathways.

Conclusions

The proposed CNN model, DEGnext provides a novel approach for prediction of UR and DR genes from both trained and untrained datasets using both logFC values and disease-related biological knowledge. The downstream analysis of the predicted UR and DR genes has provided insights into the underlying mechanisms and aided in the identification of the prime regulators of carcinogenesis of breast cancer and uterine cancer. Therefore, through the prediction and classification of DEGs, DEGnext may aid in the exploration of potential biomarkers of a disease from other RNA-seq datasets.

Methods

Dataset collection and preprocessing

We used TCGABiolinks R package [28] to download 17 datasets (listed in Table 1) from TCGA portal. Figure 1 is a schematic depiction of the workflow of the methodology which is described below.

  1. 1

    Although each of the 17 downloaded datasets had different sample sizes, there were 60,483 mRNA transcripts in each dataset. We used the following queries and data categories: Transcriptome Profiling; data type: Gene Expression Quantification; workflow type: HTSeq-Counts; sample type: Primary Tumor and Solid Tissue Normal; legacy: FALSE, to download the cancer datasets from the TCGA portal. First, we prepared the dataset to represent it as an expression matrix with genes as rows and samples as columns. Out of 60,483 mRNA transcripts, we obtained 56,493 mRNA transcripts, which were mapped to the human genome (Genome Reference Consortium Human Build 38, GRCh38).

  2. 2

    We preprocessed the gene expression data using TCGAanalyze_Preprocessing() with a gene expression cut off threshold = 0.6 and found that 56,493 number of mRNA transcripts above this threshold were within the inter-quartile range.

  3. 3

    We mapped the ENSEMBL identifiers of the mRNA transcripts, and kept only those mRNA transcripts which had valid HGNC gene symbols. From the 56,493 mRNA transcripts, we found that 37,614 genes had valid HGNC symbols.

  4. 4

    In RNA-seq data, different sample conditions have different sequencing depths and RNA compositions, which may cause complications in downstream analysis [40]. We performed data normalization using TCGAanalyze_Normalization() to adjust several gene-level effects, such as GC-content and sequencing depth. Internally, TCGAanalyze_Normalization() utilizes the EDASeq package [41] to perform within-lane normalization and between-lane normalization [42]. 8686 genes remained after normalization for these gene-level effects.

  5. 5

    For each of the 17 datasets, we filtered the genes using TCGAanalyze_Filtering(), with a quantile cut off 0.25 and found 6514 filtered genes (FG) above this threshold. We used a strict parameter cut off for preprocessing and filtering of the genes before DE analysis, because the main objective of our approach is to find significant predicted UR and DR genes related to cancer progression.

  6. 6

    We used TCGAanalyze_DEA() to perform DE analysis on the filtered genes (FG) across normal versus tumor conditions with a false discovery rate (FDR) cutoff 0.01, yielding significant labeled DEGs (SDEGs) for each cancer dataset. Out of the 6514 FGs, the non-significantly differentially expressed for FDR cutoff of 0.01 were labeled “2” as neutral genes.

  7. 7

    The number of SDEGs for each dataset was different as shown in Table 6. Next, we labeled the SDEGs on the basis of logFC threshold = 0. That is, if the logFC value of a SDEG was below 0, then the DEG was labeled as “0” for down regulated (DR) gene. If it was above 0, then the DEG was labelled as “1” for up-regulated (UR) gene.

  8. 8

    We input the SDEGs for each dataset into the IPA tool [43] to check if they were related to the specific cancer. We found that, between 4 and 47\(\%\) of the significant DEGs were related to the respective cancer disease and we categorized them as biologically validated data (“bio data” or Q). The neutral genes and non-biologically validated genes together formed the remaining data (“non-bio data” or P).

  9. 9

    We split P and Q data as shown in Fig. 1. The non-bio data (P) were split as non-bio train data (T1) and non-bio test data (T2) in the ratio of 80:20. Similarly, the bio data (Q) were split in the ratio of 80:20 as bio-test data (T3) and fine-tune (F1) data. In order to avoid bias, we considered fivefold cross validation to test the model.

  10. 10

    We performed two experiments: one for general training and testing and the other experiment for transfer learning. We leveraged the knowledge (features, weights) learned from the previously trained DEGnext model to predict UR and DR genes from 8 untrained datasets.

  11. 11

    For general learning, we tested the effectiveness of DEGnext model to classify UR and DR genes from respective bio-test data (T3) of all 17 datasets.

  12. 12

    We checked the generalizability of DEGnext by using transfer learning of the significant feature-maps into bio data (Q) of 8 untrained datasets.

  13. 13

    We compared the performance of our DEGnext against five ML methods, DTC, KNC, RFC, SVC, and XGBoost in terms of accuracy, recall, precision, F-measure, MCC, and ROC scores.

  14. 14

    We tested the robustness of DEGnext by augmenting the datasets with seven levels of Gaussian noise data (1, 10, 50, 100, 500, 1000, 1500) and compared results with other ML methods.

  15. 15

    We used two tools, ToppGene Suite and IPA, for Gene Ontology (GO) and pathway enrichment analysis of the predicted UR and DR genes for BRCA and UCEC datasets, respectively.

In Table 6, for each dataset, we show the size of preprocessed and filtered datasets, number of significant labeled DEGs (sDEGs), number of genes in non-bio train data (T1), non-bio test data (T2), fine-tune data (F1), and bio test data (T3).

Table 6 Number of genes and samples from preprocessed and filtered gene expression data used in labeling, training, fine-tuning, and testing

DEGnext model construction and implementation

The DEGnext is a CNN model proposed to predict UR and DR genes from RNA-seq cancer datasets. We implemented the CNN model using Pytorch in Python DL platform [44], as shown in Fig. 2.

The input to the model is a 1D input vector (\(x_{1}\), \(x_{2}\), \(\dots\), \(x_{n}\)), which represents each gene row of the cancer dataset. This 1D vector is converted to a 2D matrix of channel 1 using np.reshape() function. We used a sequence of eight 2D convolutional neural network (CNN) layers (\(C_{1}\), \(C_{2}\), \(\dots\), \(C_{8}\)) with ReLU() as activation function. Each CNN layer uses kernel-size (3, 3), stride of 1, and padding equal to 1. We used a 2D Maxpool layer of kernel-size 2. In order to make the model inclusive for any input size, we used a 2D AdaptiveMaxPool layer with target output size of 1 \(\times\) 1. The output of the CNN layers is fed to a sequence of 5 linear layers (\(L_{1}\), \(L_{2}\), \(\dots\), \(L_{5}\)) with ReLU() as activation function. In DEGnext, we have used the activation output from the last linear layer as feature representation and applied Softmax() to find the probabilities of each class in the range of [0, 1]. The values for the key hyperparameters are listed in Table 7.

Table 7 Values of hyperparameters used in DEGnext model

We performed two experiments to test the effectiveness of our model.

Experiment 1 (General Learning): In the first experiment, we used all 17 cancer datasets to train, fine-tune and test the corresponding bio-test data (T3) from each dataset. Since the non-bio train data (P) has three labels ‘0’, ‘1’, and ‘2’, this training is for a three-class problem. For the first-level of training, DEGnext runs for 50 epochs with a batch size of 256 and it uses CrossEntropyLoss() as a loss function and optim.Adam() as an optimizer to compute the cross entropy loss between the output (\(y^{pred}\)) for a given input x and updates the parameters based on the gradients. For predicted classes 0, 1, 2, the input gene is classified as DR, UR or neutral gene.

For the second level of training, we use fine-tune data (F1) on the best model from first level of training for each dataset. Since fine-tune data (F1) have ‘0’ and ‘1’ labels, the second level of training is a two-class problem. Here, we used the BCEWithLogitsLoss() loss function to fine-tune the model with a batch size of 64 for each dataset. After training for 31 epochs, the respective models are then tested using bio-test data (T3) of each dataset. The second level of training incorporates both prior disease-related biological knowledge and log2FC estimates (sample variance) of the data to the CNN model, which enables capture of non-linear gene expression patterns and enhances prediction performance of the model in determining UR and DR genes. The major advantage of our CNN model is that it allows performing very efficient transfer learning by reusing the feature-map signatures learned from the trained model.

Experiment 2 (Transfer learning): For the second experiment, we divided 17 datasets into two groups: training datasets (BRCA, LIHC, LUAD, LUSC, KIRC, KIRP, PRAD, THCA, and UCEC) and testing datasets (BLCA, CHOL, COAD , ESCA, HNSC, KICH, READ, and STAD). The training datasets are larger in size than the testing datasets. The testing datasets comprised of 100 \(\%\) of biologically validated data. We choose the best fold data for each dataset and trained on 80\(\%\) of non-bio train data (T1) of all 9 training datasets one after another with a batch size of 64. For training, since non-bio train data (T1) has three labels, ‘0’, ‘1’, and ‘2’, we used CrossEntropyLoss() as a loss function and optim.Adam() as an optimizer, with a batch size of 256 to train the model on the 9 training datasets one after another. For fine-tune, all we needed to do was to customize and modify the output layer L5 and remove the final softmax layer to classifying the DEGs as ‘0’ or ‘1’. We used the BCEWithLogitsLoss() loss function to fine-tune the model again with the fine-tune data (F1) for all 9 training datasets. For testing, we did not retrain the model, but instead used the pretrained model to predict UR and DR genes from all 8 testing datasets.