Identification of Potential Biomarkers Using Integrative Approach: A Case Study of ESCC

This paper presents a consensus-based approach that incorporates three microarray and three RNA-Seq methods for unbiased and integrative identification of differentially expressed genes (DEGs) as potential biomarkers for critical disease(s). The proposed method performs satisfactorily on two microarray datasets (GSE20347 and GSE23400) and one RNA-Seq dataset (GSE130078) for esophageal squamous cell carcinoma (ESCC). Based on the input dataset, our framework employs specific DE methods to detect DEGs independently. A consensus based function that first considers DEGs common to all three methods for further downstream analysis has been introduced. The consensus function employs other parameters to overcome information loss. Differential co-expression (DCE) and preservation analysis of DEGs facilitates the study of behavioral changes in interactions among DEGs under normal and diseased circumstances. Considering hub genes in biologically relevant modules and most GO and pathway enriched DEGs as candidates for potential biomarkers of ESCC, we perform further validation through biological analysis as well as literature evidence. We have identified 25 DEGs that have strong biological relevance to their respective datasets and have previous literature establishing them as potential biomarkers for ESCC. We have further identified 8 additional DEGs as probable potential biomarkers for ESCC, but recommend further in-depth analysis.


Introduction
For a critical disease of interest, the knowledge of differentially expressed genes (DEG) are a crucial step toward biomarker identification. This is achieved through Differential Expression Analysis (DEA) which monitors the behavior of each gene in isolation over normal and disease conditions and streamlines the search for biomarkers by providing a candidate list of these discriminative candidate genes. DNA microarray and RNA Sequencing (RNA-Seq) are indispensable methods for DEA. Previously, microarray technology was the most widely used approach. However, there are inherent limitations such as the pre-requisite prior knowledge of the sequence for the array design or the fact that cross-hybridization makes it difficult to analyze highly correlated sequences. Furthermore, the lack of sensitivity to highly or lowly expressed genes as well as the lack of reproducibility across laboratories and platforms pose major challenges. These limitations are overcome by RNA-Seq technology. Dhruba K. Bhattacharyya has contributed to this work.
This article is part of the topical collection "Pattern Recognition and Machine Learning" guest edited by Ashish Ghosh, Monidipa Das and Anwesha Law. Numerous DEA methods have been developed to serve both microarray and RNA-Seq technologies. Furthermore, a large number of datasets related to critical diseases that are compatible with both technologies are widely available. Keeping in mind the fact that most methods developed for these technologies are not effective for all cases, we propose a consensus-based integrative approach that ensembles a selected few of these methods with the aim to achieve improved performance. In other words, in this paper, we present a consensus-based approach that employs a few chosen microarray DEA methods (Limma [1], SAM [2] and EBAM [3]) and RNA-Seq (Limma+Voom [4], edgeR [5], DESeq2 [6]) to present an unbiased list of DEGs as candidates for potential biomarkers. As the primary focus of our work is on the critical disease ESCC, we validate our results on two ESCC microarray datasets (GSE20347 and GSE23400) datasets and one ESCC RNA-Seq (GSE130078) dataset.
The rest of the paper is organized as follows. Section "Related Work" provides a brief overview of related work. Section "Proposed DE Framework" describes our proposed framework for biomarker identification of critical disease(s) employing three microarray and three RNA-Seq methods. We also introduce our consensus function for unbiased integration of DEGs individually detected by previously mentioned methods. Section "Analysis" reports a detailed experimental analysis and validation of our method on two benchmark microarray gene expression datasets and one RNA Sequencing (RNA-Seq) dataset. Section "Discussion" presents a detailed analysis and discussion on candidate genes that have been identified as potential biomarkers for ESCC. In this section, we also present a comparison of our algorithm with two recent works with similar approaches. Finally, the concluding remarks are given in Section "Conclusion".

Related Work
A number of statistical approaches are used by Limma [1] to achieve effectiveness in large-scale expression studies. Limma takes advantage of the flexibility of linear models and fits one to each row (gene) of data in a gene expression matrix where columns correspond to samples. Limma has the inherent ability to model correlations between samples through analysis of the entire dataset as one entity. Significance Analysis of Microarrays (SAM) [2] assimilates a set of gene-specific t-tests to identify genes that exhibit statistically significant changes in expression. For each gene, based on the changes in gene expression in terms of standard deviation for repeated measurements, SAM assigns a score. Potentially significant genes are identified using a threshold for these scores. Empirical Bayes Analysis of Microarrays (EBAM) [3] employs the removal Bayes rule to obtain the posterior probability that a gene was affected or unaffected under the various conditions. EBAM makes multiple testing a possibility by establishing a connection between prior probabilities and local false discovery rate (local fdr) in turn handling the issues that arise from simultaneous tests.
Voom [4] works on the idea that commonly used microarray-based statistical methods can be applied to read counts of corresponding RNA-Seq through robust and non-parametric estimation of mean-variance. In other words, Voom incorporates the mean-variance trend into the empirical Bayes procedure of Limma. edgeR [5] and DESeq2 [6] are estimations of gene-wise dispersion by conditional maximum likelihood, conditioning on the total count for the gene. edgeR effectively borrows information within genes to shrink the dispersion towards a consensus value through the use of empirical Bayes. It adapts for overdispersed data and incorporates an exact test to assess each gene. DESeq2 accurately estimates the expected dispersion value for genes of a given expression strength, which is then used to conform the gene-wise dispersion towards the predicted values . It also accounts for gene-specific variations and makes it possible to estimate fitted curves and testing even in settings with less information.

Proposed DE Framework
The proposed framework aims to work with microarray and RNA Sequencing data. We have chosen three methods that work on micro-array (Limma, SAM and EBAM) and three on RNA-Seq (Limma+voom, DESeq and EdgeR). First, both types of data require pre-processing. For microarray, preprocessing consists of the removal of unwanted and redundant information, normalization of the dataset, missing value estimation while for RNA-Seq data, we perform removal of low read counts, normalization, and transformation.
Pre-processing is followed by DE analysis that results in differentially expressed genes (DEG). For each data type, we employ a consensus function that filters out all common DEGs for each dataset. In other words, depending on the type of the input dataset, the DE analysis unit detects DEGs using three corresponding methods, followed by a consensus function that filters the DEGs common to all three methods as well as identify other relevant DEGs. Our consensus function is given by Here, and are q-value and local.fdr significance values that are chosen according to their relevance to the experiment. Through multiple iterations of implementation, we observed that consideration of only genes common to all three methods leads to information loss. Thus, to overcome this we introduced q-value into the consensus function. The main idea behind this is that while a p-value of 0.05 gives the implication that 5% of the tests will be false positive (FP), q-value, which is an FDR adjusted p-value, implies that 5% of the test found to be significant will be FP. q-value requires a very important adjustment for multiple tests on the same data sample. Our consensus function considers all genes common to all three methods with p = 0.05 as detected DEGs. Furthermore, all genes that are not among the common genes but have a q = 0.05 , i.e. (Eq. 1), are added to the list of DEGs. However, in the microarray datasets, we implement the proposed consensus function given by Eq. 1 to start off by taking the DEGs common to all three methods. Unlike RNA-Seq, instead of q-value, we incorporate its useful counterpart local fdr ( in Eq. 1). Local fdr is a measure of the posterior probability that the null hypothesis is true. We use local fdr since Limma and SAM calculate p-value. EBAM, on the other hand, estimates the posterior probability and local fdr. It is worth mentioning DEGs others = DEGs ∉ DEGs common q -value ≤ , for RNA -Seq, DEGs ∉ DEGs common local.fdr ≤ , for Microarray that posterior probability and p-value are not interchangeable. However, local fdr can be estimated from p-values. Our consensus function (Eq. 1) considers all genes common to all three methods with p = 0.05 as detected DEGs. Furthermore, all genes that are not among the common genes but have a local.fdr = 0.05 ( ) are added to the list of DEGs. The relevant DEGs are then taken as input to the DCE analysis unit. The idea behind performing DCE analysis is that it leads to the creation of biologically relevant modules which are easier for further analysis and validation. The DCE unit identifies differentially co-expressed modules and performs preservation analysis on these modules to identify biologically relevant modules. This is followed by the identification of hub genes in these modules using WGCNA [7] intramodular connectivity.
We validate our results using several approaches. First, we consider all relevant DEGs detected by the DE analysis unit in isolation and perform Gene Ontology (GO) and KEGG pathway enrichment analysis to validate biological relevance. We consider all the hub genes in the biologically relevant modules identified by the DCE unit as biomarker candidates. Furthermore, all DEGs that are annotated with the most enriched GO term in all three databases as well as the most enriched KEGG pathway are also considered as candidates for biomarkers. We term such genes as Top Enriched DEGs (TEDs). Secondly, we further analyze the biomarker candidates through observation of their interactions and behavioral changes. Finally, we trace literature evidence for the relevant genes in other scientific

Analysis
Our focus is on ESCC, a cancer very common in developing countries, especially in North-East India, and is highly attributed to tobacco and betelnut chewing, alcohol consumption as well as poor diet. Two microarray datasets GSE20347 and GSE23400 and an RNA-Sequencing dataset GSE130078 were chosen to validate our proposed DE framework (see Fig. 1). Details of each dataset is provided in Table 1. All three datasets examined gene expression in tumor and matched normal adjacent tissue. The test platform is a DELL workstation with Intel(R) Xeon(R) W-2145 with 3.70GHz processor, 64 GB RAM running Windows 10 Pro for workstations.

Preprocessing
RNA-Seq dataset GSE130078 has 57,783 genes and 46 samples. Large datasets tend to add complications to the analysis and as such, we filter out genes with low read counts. We achieve this by calculating the counts per million (CPM) for each sample for each gene and keep only those genes that have CPM > 1 for at least two samples. This reduces the dataset size from 57,783 to 22,270. We then follow up by normalization of the dataset. We also consider two microarray datasets GSE20347 and GSE23400 for analysis. The inputs to these datasets are expression values of genes across samples. First, we pre-process the data through the removal of unwanted and redundant genes, missing value estimation, and normalization. However, for both GSE20347 and GSE23400, there are no missing values and as such we proceed further down the pipeline.

DE Analysis
For the microarray datasets, Limma takes the pre-processed dataset as input and outputs the equivalent DEGs with a significance of 5% ( p-value ≤ 0.05 ) and False Discovery Rate (FDR) of 0.05. On the other hand, for the other two methods SAM and EBAM, we employ findDelta with FDR = 0.05 fiving us an estimate of the delta values at which FDR is closest to 0.05 and chose accordingly. In SAM, delta is the distance between the observed and the expected test scores, whereas in EBAM, delta is the probability that a gene with a specific test score is differentially expressed. Table 2 summarizes the DEGs detected by all three methods on all three datasets.
In the case of the RNA-Seq dataset, the pre-processed data are the input to all three methods, i.e., Limma+Voom, edgeR and DESeq2. However, it is to be noted that while DESeq2 directly takes the count data as input, the other two methods require the count data to be transformed into a DGEList (Digital Gene Expression Data) object. All the methods perform multiple tests on all the 22,270 genes in the dataset across 46 samples. We consider a significance of 5%, i.e., p-value ≤ 0.05 and the corresponding DEGs detected by the 3 these methods are summarized in Table 2.
We implement the proposed consensus Eq. 1 to identify the common genes detected by these three methods. First, we consider the DEGs detected by all three methods, i.e. common genes. In GS20347, there are such 7706 DEGs. So as not to bypass crucial information, we use in Eq. 1, i.e., the consensus function. With local.fdr = 0.05 ( ) another 662 genes are considered DEGs resulting in a list of 8,368 DEGs. Similarly, in GSE23400, Limma, SAM, and EBAM find 3,431 common DEGs. With local.fdr = 0.05 ( ), another 4,066 genes are considered as DEGs, resulting in a list of 7,497 DEGs. In the case of GSE130078, the three methods Limma+Voom, edgeR, and DESeq2 discover 2,765 common DEGs and a q-value ( ) adds another 9,945 genes resulting in a list of 12,710 DEGs.

DCE Analysis
To analyze the interactions among the DEGs as well as the variations in behavior under normal and disease circumstances, we construct co-expression networks (CEN) using WGCNA [7]. The pipeline for DCE analysis is to detect DEGs by our method is as follows.
1. Divide the dataset into separate datasets: All genes with only normal samples and all genes with only disease samples 2. Choose the soft thresholding power to which co-expression similarity is raised to calculate an adjacency matrix. Soft thresholding power is based on the criterion of approximate scale-free topology. 3. Construct two separate CENs in the form of an adjacency matrix: normal and disease. 4. Transform adjacency matrices into topological overlap matrices (TOM [8]) to minimize the effects of noise and spurious associations 5. Extract all connections that correspond to the subset of DEGs from both CENs. 6. Extract normal and disease modules using hierarchical clustering. 7. Merge modules through eigenvector module selection and MEDissThres threshold merging. 8. Identify modules extracted in the normal dataset that are non-preserved in disease dataset and vice versa through preservation analysis [9]. We consider such modules as modules of interest for further downstream analysis. 9. Identify the top 20 hub genes using intramodular connectivity [7] in all modules of interest.
We start DCE analysis by clustering the samples using the hierarchical approach to detect outlier samples. We remove the outlier samples with the aim of creating a more robust CEN. For GSE23047, as seen in Fig. 2a, b, we find a single outlier in the case of normal samples with a tree cut at height h = 70 (Blue). However, in disease samples, there are two outliers with a cut at h = 130 (Red). Similarly, in the case of GSE23400, as seen in Fig. 2c, d, a tree cut at height h = 105 (Blue) and at h = 95 (Red) removes one and two outliers from normal and disease samples, respectively. In the case of GSE130078, a cut at h = 1,500,000 (Blue) and h = 2,000,000 (Red) removes one normal (Fig. 2e) and one disease Fig. 2f sample.
In GSE20347, hierarchical Clustering and tree cut results in 50 and 75 normal and disease modules, respectively. Figure 3a shows the dendrogram while the first strip of colors below represents the corresponding module colors for the normal dataset. Similarly, Fig. 3b shows the dendrogram for the disease dataset. To merge modules, we choose a height cut of 0.25, corresponding to a correlation of 0.75. Merging of the modules with tree cut at h = 0.25 further reduces the number of modules to 38 and 61 for normal and disease datasets, respectively. The second color strip in Fig. 3a, b shows the colors for the merged normal and disease modules respectively. In GSE23400, hierarchical clustering results in 9 normal (the first color strip in Fig. 3c) and 13 (the first color strip in Fig. 3d) disease modules, which are then reduced to 8 normal (the second color strip in Fig. 3c) and 11 disease (second color strip in Fig. 3d) modules after merging. Finally in GSE130078, hierarchical clustering results in 65 normal (the first color strip in Fig. 3e) and 40 disease (the first color strip in Fig. 3f) modules, which are then reduced to 21 normal (the second color strip in Fig. 3e) and 24 disease (the second color strip in Fig. 3f) modules after merging.
We follow module extraction by module preservation analysis with the aim of analyzing the distinction between preserved and non-preserved modules. According to Langfelder et al. [9], while the preserved modules retain a majority of their co-expressed connections (or edges between two genes), the same cannot be perceived from non-preserved modules. According to Langfelder et al. [9], a module with Z summary < 2 is considered non-preserved [9]. It is noteworthy that, GSE23400 due to its inherent nature, extracts a smaller number of modules with significantly larger sizes and higher densities (Fig. 4). There are no non-preserved modules with Z summary < 2 and most modules are either moderately preserved or highly preserved. We take into consideration moderately preserved modules with Z summary < 10 [9]. Table 3 summarizes the preservation analysis for non-preserved modules in all three datasets. The second column highlights the module preservation reference and test networks. For example, the table reading for module pink in Normal/Disease subset of dataset GSE20347 can be interpreted module pink of size 278 detected in the normal network that is nonpreserved in disease network with a Z summary value of -0.118842161.
We only consider non-preserved modules of substantial size ( size > 100 ) as modules of interest for further downstream analysis and validation. To find the hub genes for each module of interest extracted previously we employ WGCNA intramodular-connectivity proposed by Langfelder et al. [7]. Intramodular-connectivity calculates the connectivity of a node to other nodes in the same module.

Enrichment Analysis of Modules
For a module of interest to be regarded as Gene Ontology (GO) or pathway enriched, at least one gene in the module must be assigned to an enriched GO term or pathway, respectively with a significance of 5% (i.e., p ≤ 0.05 ). To perform functional enrichment analysis, we use the easily available online tool DAVID [10,11]. Table 5 summarizes the percentages of genes in the modules of interest annotated to enriched GO terms as well as enriched KEGG pathways. We observe that all modules of interest are GO and pathway enriched.

Candidate Genes
As mentioned earlier, we select DEGs as candidates for potential biomarkers based on the following two criteria: 1. All hub genes detected by the DCE analysis unit of our framework in all modules of interest are candidate genes. 2. DEGs that have been annotated to the most enriched GO terms in all three GO databases (BP, CC and MF) and are also annotated to the most enriched pathway after GO and Pathway enrichment analysis on the entire dataset are also considered as potential biomarkers. We rename these DEGs as TEDs (Top Enriched DEGs) Thus, alongside all DEGs that are among Top 20 hub genes in modules of interest (as summarized in Table 4), our second criterion adds 22, 18 and 11 TEDs to the list of candidate genes in GSE20347, GSE23400 and GSE130078, respectively. We summarize these DEGs GSE130078 (e, f) (TEDs) in Table 6. The numbers of candidate genes for GSE20347,GSE23400 and GSE130078 increase from 140, 60 and 160 to 162, 78 and 171, respectively. We perform the enrichment analysis on the entire dataset or in more specific terms the list of all genes in the dataset. This leads to the observation that as the lists of genes in GSE20347 (22,278 genes) and GSE23400 (22,283 genes) are almost the same, the list of top enriched genes (35 genes) extracted are the same. However, the differences in TEDs are seen (Table 6) due to the fact that there are DEGs identified in one dataset that might not be detected in the other. all modules between the red and blue lines ( 2 < Z summary < 10 ) are weak to moderately preserved and all modules above the blue line ( Z summary > 10 ) have strong evidence of being preserved

Biological Analysis
To establish the biological relevance of the candidate genes detected by our method, we use functional enrichment analysis and the construction of a gene regulatory network (GRN). Transcription Factors (TF) have remarkable diversity as well potency as drivers of cell transformation. Bhagwat et al. [12] justify the continued pursuit of TFs as potential biomarkers across many forms cancer by the prevalent deregulation of the same. We observe that 26 (hub genes:21, TEDs:5), 11 (hub genes:6, TEDs:5) and 23 (hub genes:23, TEDs:0) candidate genes detected by our method in GSE20347, GSE23400 and GSE130078, respectively are TFs. These TFs exhibit regulatory behavior in their respective modules, establishing their biological relevance. For easy visualization, we extract a manageable subset of hub genes from the nonpreserved modules detected by our method (Figs. 5a-f and 6a-f). We construct a Gene Regulatory Network (GRN) with these hub genes and associated Transcription Factors (TFs) so as to observe the regulatory behavior of the corresponding genes. The resulting GRN is in the form of an adjacency list with weighted directed edges from TFs to other target genes (TGs).
As in the the case of validation of modules, we employ DAVID [10,11] to perform functional enrichment analysis of all candidate genes detected by our method. A candidate gene can be regarded as GO enriched considering a GO database (GO_BP, GO_CC, GO_MF) if it is annotated to at least one GO term in that database with significance of 5% ( p ≤ 0.05 ). Tables 7, 8 and 9 summarize the candidate genes annotated to the top 3 GO terms in each GO database in GSE20347, GSE23400 and GSE130078, respectively. Similarly, a candidate gene is KEGG pathway enriched if it is annotated to at least one KEGG pathway term with significance of 5%. Table 10 summarizes the candidate genes annotated to top 3 enriched KEGG pathways in GSE20347, GSE23400 and GSE130078.

Literature Trace
Zhu et al. [13] highlight that prothymosin alpha (PTMA) expression was up-regulated in ESCC tissues, thus presenting PTMA as a potential candidate for ESCC. Tang et al. [14] indicated that the expression of PTPRF interacting protein alpha 1 is significantly increased and is related to some malignant clinical features and poor outcomes in ESCC patients, thus establishing it as a valuable biomarker for early detection, treatment formulation and prognosis evaluation for ESCC. Jiang et al. [15] suggested that downregulation of VGLL4 was very important in the progression of ESCC, and restoring the function of VGLL4 might be a promising therapeutic strategy for ESCC. In Shen et al. [16], homer scaffolding protein 3 (HOMER3) is one of the three genes presented as candidate cancer-associated genes and may play a tumorigenic role in ESCC. Ma et al. [17] summarized that upregulation of Proteasome 26S subunit non-ATPase 4 (PSMD4) promotes the progression of ESCC through the reduction of ERS-induced cell apoptosis. Chen et al. [18] found that  [24] concluded that a lower expression of Processing Of Precursor 7 (POP7) predicts a worse prognosis in esophageal cancer. Qiu et al. [25] suggested that through activation of AKT1/mTOR signaling pathway, maintenance complex component 7 (MCM7) promotes tumor cell proliferation, colony formation and migration of ESCC cells. Choy et al. [26] and [27] further suggested MCM7 as a more sensitive proliferation markers for evaluation and for predicting various clinical outcomes of ESCC respectively. Miyazaki et al. [28] concluded that ephrin receptor A2 (EphA2) overexpression appears to be related to poor degree of tumor  [33] demonstrated that the expression of cyclin-dependent kinase subunit 2 (CKS2) in ESCC was elevated relative to levels in normal tissue, and that CKS2 overexpression is associated with the depth of tumor invasion, lymphatic invasion, clinical stage, distant metastasis and poor prognosis. Zheng et al. [34] found that the expression of cytokine induced apoptosis inhibitor 1 (CIAPIN1) was statistically correlated with the degree of differentiation, depth of invasion, and lymph node metastasis of ESCC and thus has been considered as a valuable prognostic indicator in ESCC. Zhao et al. [35] highlighted that Protein arginine methyltransferase 1 (PRMT1) activates and maintains esophageal TICs by mediating transcription alteration through histone H4 arginine methylation. Zhou et al. [36] highlighted that PRMT1 activates Hedgehog signaling and up-regulated the expression of target genes downstream of Hedgehog signaling thus taking an oncogenic role of PRMT1 in the progression of ESCC. Zhang et al. [37] provided evidence that Human Leukocyte Antigen-F (HLA-F) antigen expression was associated with survival in patients with ESCC. Yie et al. [38] established that Human Leukocyte Antigen-G (HLA-G) expression has a strong and independent prognostic value in human ESCC. According to Sato et al. [39], high chemokine (CXC motif) ligand 10 (CXCL10) expression is an independent prognostic factor and has the potential to serve as a clinically useful marker of the need for adjuvant chemotherapy after surgery in patients with advanced thoracic ESCC. Yuan et al. [40] suggested the tumor promotion role of Interferon-stimulated gene 15 (ISG15) in ESCC via c-MET/  [57] indicated that insulin-like growth factor 2 mRNA-binding protein 2 (IGF2BP2) serves a major carcinogenic role in ESCC. According to [58], the expression of vascular endothelial growth factor C (VEGF-C) correlates with lymph node metastasis and poor prognosis. Similarly, [59] suggests that VEGF-C expression in ESCC may play a great key role in lymphatic spread. Feng et al. [60] indicated that ras-like without CAAX1 (RIT1) displays tumor-suppressing functions in ESCC, and these functions were carried out by inhibiting MAPK and PI3K/AKT signaling pathway, inhibiting EMT, and downregulating cancer stemness of ESCC cells. Zhou et al. [61] demonstrated that Dehydrogenase/reductase member 2 (DHRS2) had an important part in ESCC development and progression. Li et al. [62] suggested a tumor suppression function for RNA Binding Motif Single Stranded Interacting Protein 3 (RBMS3) gene in ESCC. According to Wang et al. [63], UBE2T is involved in the development of ESCC, and gene signatures derived from UBE2T-associated genes are predictive of prognosis in ESCC. Gao et al. [64] demonstrated that High Mobility Group Box 3 (HMBG3) may be a potential molecular marker for predicting the prognosis of ESCC patients. According to Huang et al. [65], cyclin-dependent kinase 4 (CDK4) amplification was identified as an independent prognostic factor for survival, which could be incorporated into the tumornode-metastasis staging system to refine risk stratification of patients with esophageal squamous cell carcinoma. Ling et al. [66] suggested that MutS Homolog 2 (MSH2) methylation in the plasma would be a good predictor of DFS for these ESCC patients before oesophagectomy. Xu et al. [67] identified Estrogen-related receptor gamma (ESRRG ) as one of four molecular markers that may be helpful  :0007165 signal transduction  PRKACB, BCR, AR, LYN, APPL1, STAT1, SHC1, NFKB2, PIK3CD, PIK3CB, PIK3R2,  PIK3R1, EXT2, RAC2, PPFIA1, RAF1, GNB5, GSK3B, MAPK10, KRAS, RARA,  MAP2K1, TXNRD1, FLT3LG, FADD,

Discussion
In Table 11, we give a detailed summary of all DEGs that have been identified by our method as candidates for potential biomarkers for ESCC. In our method, we consider strong literature evidence for association with ESCC and five other SCCs related to ESCC as the necessary criterion for a candidate gene to be a potential biomarker, and the findings from literature are summarized in Table 11. In the table, we also highlight the enriched GO terms and pathways to which the candidate genes has been annotated. Furthermore, it also details whether the same is a hub gene, a transcription factor (TF) or whether it is upregulated or down-regulated. A DEG is upregulated if logFC > 0 and downregulated when logFC < 0 . We take into consideration logFC values calculated by limma for the microarray datasets, and edgeR in the RNA-Seq dataset. The biological relevance of a candidate to its respective dataset is considered based on three criteria: (a) Annotated to at least one GO term in 2 of 3 GO databases with p value ≤ 0.05, (b) Annotated to at least one KEGG pathway with p value ≤ 0.05 , and (c) It's a TF and thus exhibits regulatory behavior towards other DEGs in the network.
For a candidate gene to be considered a potential biomarker, we consider following four cases.
In Table 12, we put forward a comparison between our work and two recent works presented by Patowary et al. [166] and Hu. et al. [44] that perform DE analysis by employing approaches and methods similar to our work.

Conclusion
All six methods, three microarray and three RNA-sequencing, employed by the proposed integrative approach based differential expression (DE) Analysis framework were found effective in extracting differentially expressed genes (DEGs) with a p-value of 0.05. We further proposed a consensus function that takes into account the information loss due to the DEGs common to all three respective methods and  BCR, AR, APPL1, CKS2, RARA, STAT1, MAP2K1, NFKB2, PIK3CD, PIK3CB,  TXNRD1, FLT3LG, PIK3R2, PIK3R1, FADD, RAC2, FAS, RAF1, TGFB2, GNAQ,  GNB5, GSK3B, MAPK1, MAPK10, KRAS  hsa04010:MAPK signaling pathway  PRKACB, FLT3LG, EPHA2, RAC2, FAS, MAP2K1, RAF1, MAP3K20, TGFB2,  NFKB2, Table 11 Summary of potential biomarkers identified by our proposed framework (Fig. 1 further employs local fdr (for microarray) and q-value (for RNA-Seq). Through differential co-expression (DCE) and preservation analysis, we studied the behavioral changes among the DEGs under normal and disease circumstances. All non-preserved modules of reasonable sizes are considered modules of interest and analyzed further down the pipeline. DEGs are considered candidates for potential biomarkers for ESCC when they are either: (a) hub genes in the modules of interest, or (b) Top Enriched DEG (TED), i.e., a DEG annotated to the most enriched GO term in all three GO databases as well as the most enriched KEGG pathway in their respective datasets. Our proposed framework was validated on two microarray datasets (GSE20347 and GSE23400) and one RNA-Sequencing dataset (GSE130078). From, 7, 3 and 8 modules of interest in GSE20347, GSE23400 and GSE130078 respectively, 124, 59 and 160 hub genes were identified. The consideration of 22, 18 and 16 TEDs detected by GSE20347, GSE23400 and GSE130078 respectively results in 146, 77 and 176 as candidates for potential biomarkers of ESCC. Biological relevance for each candidate to their respective datasets is analysed based on (a) annotation to enriched GO terms in the GO databases, (b) annotation to enriched KEGG pathways, and (c) if the candidate gene is a transcription factor in a gene regulatory network. Another very important criterion that we considered for a candidate gene to be a potential biomarker is previous literature that has either (a) established them as potential biomarkers for ESCC itself, or (b) established them as potential biomarkers for 5 other SCCs related to ESCC, namely, Oral SCC, Tongue SCC, Lung SCC, Head and Neck SCC and Laryngeal SCC. Our method identified 4 candidate genes, STAT1, HIF1A , DNMT3B and MCM7, that are transcription factors (TFs), have strong biological relevance to their respective datasets and have previous literature works that establish their role as potential biomarkers in ESCC. Our method identified GSK3B , detected as DEG by both microarray datasets (GSE20347 and GSE23400), as a TED and has both strong literature evidence as potential biomarker of ESCC and we established its strong biological relevance to both microarray datasets. Similarly, nine (HOMER3 , PSMD4, PSTAT1, TFRC, MCL1, EPHA2, KPNA2, CKS2 and PRMT1), seven (HLA-F, HLA-G, CXCL10, ISG15, PSMA3, FCGR2A and C3AR1) and four (CAV1, VEGFC, CDK4 and MSH2 ) candidates genes in GSE20347, GSE23400 and GSE130078 are established as potential biomarkers for ESCC. We further identified 3 (PTMA, VGLL4 and NONO), 1 (PLEK2) and 2 (HMGB3 and ESRRG ) TFs in GSE20347, GSE23400 and GSE130078 respectively, that have strong literature evidence as potential biomarkers of ESCC, but have moderate evidence for biological relevance to their respective datasets, and thus can be regarded as probable potential biomarkers for ESCC. On the other side of the spectrum, the transcription factor AR, a TED that is identified as a DEG in both microarray datasets and PSMC2 have strong biological relevance to their datasets, but have been identified as potential biomarkers for other SCC related to ESCC. They can also be considered probable potential biomarkers for ESCC, but need further in-depth analysis. For future work, there is scope for improvement in the framework in general and consensus function specifically, towards detection of a smaller number of DEGs with minimum loss of relevant information.

Declarations
Conflict of interest On the behalf of all authors, the corresponding authors states that there is no conflict of interest.