Background

Liver cancer is the second most deadly cancer in terms of mortality rate, with a very poor prognosis [60]. It accounted for 9.1% of all cancer deaths, and 83% of the annual new estimated 782,000 liver cancer cases worldwide occur in developing countries [13]. Liver cancer showed the greatest increase in mortality in the last decade for both males (53%) and females (59%) [8]. Liver hepatocellular carcinoma (LIHC) or simply hepatocellular carcinoma (HCC) is the most common type of liver cancer, accounting for nearly 85% of liver cancers. 78% of all reported cases of HCC were due to viral infections (53% Hepatitis B virus and 25% Hepatitis C virus) [38]. There are several non-viral causes of HCC as well, mainly aflatoxins and alcohol [10]. As shown in Fig. 1, all the factors converge to a common mechanism of genetic alterations that lead to the acquisition of cancer hallmarks [20] and the eventual emergence of a cancer cell [11]. Genetic alterations constitute the heart of the problem, and studying changes due to these genetic alterations is paramount to understand HCC. Earlier gene expression studies using EST data detected differential expression in cancer tissue compared to non-cancerous liver and proposed the existence of genetic aberrations and changes in transcriptional regulation in HCC [58]. The Cancer Genome Atlas (TCGA) research network [41] have subtyped and identified many potential targets for HCC based on a comprehensive multi-omics analysis. An independent analysis of TCGA RNA-Seq data encompassing 12 cancer tissues has uncovered liver cancer-specific genes [37]. Zhang et al. [63] have performed mutation analysis of HCC, and Yang et al. [59] combined TCGA expression data and natural language processing techniques to identify cancer-specific markers.

Fig. 1
figure 1

Major causative pathways of hepatocarcinogenesis. All pathways converge to progressive genomic alterations, leading a normal cell to acquire the hallmarks of cancer

The burden of disease and mortality rate are both inversely correlated with the cancer stage. The response rate to therapy is also inversely correlated with stage. To the best of our knowledge, there are no reported research in the literature that have dissected the stage-specific features of HCC. The cancer staging system is based on gross features of cancer anatomical penetration, and one such standard is the American Joint Committee on Cancer (AJCC) Tumor-Node-Metastasis (TNM) staging [2]. It is reasonable to hypothesize that the stage-specific gross changes are associated with signature molecular events, and try to probe such molecular bases of stage-wise progression of cancer. We had earlier published on stage-specific “hub driver” genes in colorectal cancer [36]. A stage-focussed analysis of colorectal cancer transcriptome data yielded negative results vis-a-vis the AJCC staging system [25].

Methods

Data preprocessing

Normalized and log2-transformed Illumina HiSeq RNA-Seq gene expression data processed by the RSEM pipeline [29] were obtained from TCGA via the firebrowse.org portal [6]. The patient barcode (uuid) of each sample encoded in the variable called ‘Hybridization REF’ was parsed and used to annotate the controls and cancer samples (Fig. 2). To annotate the stage information of the cancer samples, we obtained the clinical information dataset for HCC from firebrowse.org (LIHC.Merge_Clinical.Level_1.2016012800.0.0.tar.gz) and merged the clinical data with the expression data by matching the “Hybridization REF” in the expression data with the aliquot barcode identifier in the clinical data. The stage information of each patient was encoded in the clinical variable “pathologic stage”. The pathologic stage is essentially the surgical stage, prior to any treatment received, determined with the tissue obtained at the time of surgery. This interpretation is reinforced in the TCGA HCC sample inclusion criteria as follows: “Surgical resection of biopsy biospecimens were collected from patients diagnosed with hepatocellular carcinoma (HCC), and had not received prior treatment for their disease (ablation, chemotherapy, or radiotherapy)” (The TCGA [41]). The availability of this unequivocal information enables the analysis of cancer stages. The substages (A,B,C) were collapsed into the parent stage, resulting in four stages of interest (I, II, III, IV). We retained a handful of other clinical variables pertaining to demographic features, namely age, sex, height, weight, and vital status. With this merged dataset, we filtered out genes that showed little change in expression across all samples (defined as σ < 1). Finally, we removed cancer samples from our analysis that were missing stage annotation (value ‘NA’ in the “pathologic stage”). The data pre-processing was done using R (www.r-project.org).

Fig. 2
figure 2

TCGA ‘Hybridization REF’ Barcode. The first 10 characters constitute an anonymized unique patient identifier and the following two characters denote whether the sample is tumor or normal

Linear modelling

Linear modelling of expression across cancer stages relative to the baseline expression (i.e, in normal tissue controls) was performed for each gene using the R limma package [42]. The following linear model was fit for each gene’s expression based on the design matrix shown in Fig. 3a:

$$ \mathrm{y}=\upalpha +{\upbeta}_1{\mathrm{x}}_1+{\upbeta}_2{\mathrm{x}}_2+{\upbeta}_3{\mathrm{x}}_3+{\upbeta}_4{\mathrm{x}}_4 $$
(1)
Fig. 3
figure 3

Design matrices. a In the linear modeling, the control samples served as the baseline expression (intercept) of each gene against which the stage-specific expression was estimated. b the design matrix for the contrasts analysis

where the independent variables are indicator variables of the sample’s stage, the intercept α is the baseline expression estimated from the controls, and βi are the estimated stagewise log fold-change (lfc) coefficients relative to controls. The linear model was subjected to empirical Bayes adjustment to obtain moderated t-statistics [34]. To account for multiple hypothesis testing and the false discovery rate, the p-values of the F-statistic of the linear fit were adjusted using the method od Hochberg and Benjamini [22]. The linear trend across cancer stages for the top significant genes were visualized using boxplots to ascertain the regulation status of the gene relative to the control.

Monotonic mean expression

The linear model in eqn. (1) would not be sufficient to identify genes with an ordered monotonic trend of expression across cancer stages. Addressing this question would also help assess whether monotonic changes of gene expression were observed with disease progression. Towards this end, we designed a model of gene expression where the cancer stage was treated as a numeric variable:

$$ \mathrm{y}=\mathrm{aX}+\mathrm{b} $$
(2)

where X takes a value in [0,1,2,3,4] corresponding to the sample stage: [control, I, II, III, IV], respectively. It was noted the mean expression of a gene could show the following monotonic patterns across cancer stages:

  1. (i).

    monotonic upregulation, where mean expression follows: control < I < II < III < IV.

  2. (ii).

    monotonic downregulation, where mean expression follows: control > I > II > III > IV.

The sets of genes conforming to either (i) or (ii) were identified to yield monotonically upregulated and monotonically downregulated genes. These two sets were merged, and the final set of genes with monotonic changes of expression with cancer progression was obtained. This final set was ranked by the adj. p-values from the model estimated by eqn. (2).

Pairwise contrasts

To perform contrasts, a slightly modified design matrix shown in Fig. 3b was used, which would give rise to the following linear model of expression for each gene:

$$ \mathrm{y}={\upbeta}_0{\mathrm{x}}_0+{\upbeta}_1{\mathrm{x}}_1+{\upbeta}_2{\mathrm{x}}_2+{\upbeta}_3{\mathrm{x}}_3+{\upbeta}_4{\mathrm{x}}_4 $$
(3)

where the controls themselves are one of the indicator variables, and the βi are all coefficients estimated only from the corresponding samples. Our first contrast of interest, between each stage and the control, was achieved using the contrast matrix shown in Table 1. Four contrasts were obtained, one for each stage vs control. A threshold of |lfc| > 2 was applied to each such contrast to identify differentially expressed genes (with respect to the control). We used the absolute value of the lfc, since driver genes could be either upregulated or downregulated. Genes could be differentially expressed in any combination of the stages or no stage at all. To analyze the pattern of differential expression (with respect to the control), we constructed a four-bit binary string for each gene, where each bit signified whether the gene was differentially expressed in the corresponding stage. For example, the string ‘1100’ indicates that the gene was differentially expressed in the first and second stages. There are 24 = 16 possible outcomes of the four-bit string for a given gene corresponding to the combination of stages in which it is differentially expressed. This is illustrated in set-theoretic terms in Fig. 4. In our first elimination, we removed genes whose |lfc| < 2 for all stages. For each remaining gene, we identified the stage that showed the highest |lfc| and assigned the gene as specific to that stage for the rest of our analysis.

Table 1 Contrast matrix with control. Each stage (indicated by '1') is contrasted against the control (indicated by '-1') in turn
Fig. 4
figure 4

A Venn representation of the pairwise stages contrasts. A gene could be differentially expressed in any combination of the four stages and this could be represented by a 4-bit string, one bit for each stage. For e.g., ‘1111’ at the overlap of all four stages would be assigned to genes that are differentially expressed in all four stages

Significance analysis

We applied a four-pronged criteria to establish the significance of the stage-specific differentially expressed genes.

  1. (i).

    Adj. p-value of the contrast with respect to the control < 0.001. The expression profile of a driver gene in cancer samples would markedly depart from that for the controls, which motivates the use of a stringent threshold here.

  2. (ii).

    (ii)-(iv) P-value of the contrast with respect to other stages < 0.05. The use of a more relaxed cutoff would improve the sensitivity of stage-specific detection.

To obtain the above p-values (ii) - (iv), we used the contrast matrix shown in Table 2, which was then used an an argument to the contrastsFit function in limma.

Table 2 Contrast matrix for inter-stage contrasts. There are six possible pairwise contrasts between the stages that are essential to identifying stage-specific genes

Further analyses

Principal component analysis (PCA) were performed using prcomp in R. To choose 100 random genes, we used the rand function. Gene set enrichment analysis were performed on KEGG (https://www.genome.jp/kegg/) and Gene Ontology [5] using kegga and goana in limma, respectively. In order to visualize outlier genes that are significant with a large effect size, volcano plots could be obtained by plotting the -log10 transformed p-value vs. the log fold-change of gene expression. Heat maps of significant stage-specific differentially expressed genes were visualized using heatmap and clustered using hclust. Novelty of the identified stage-specific genes was ascertained by screening against the Cancer Gene Census v84 [14].

Results

The TCGA expression data consisted of expression values of 20,532 genes in 423 samples. After the completion of data pre-processing, we obtained a final dataset of expression data for 18,590 genes across 399 samples annotated with the corresponding sample stage (available in Supplementary File S1). The stagewise distribution of TCGA samples along with the corresponding AJCC staging is shown in Table 3. A statistical summary of demographic details including age, sex, height, weight, and vital status is shown in Table 4. The body mass index (BMI) distribution was derived from patient clinical data that had both height and weight (i.e, neither was ‘NA’). The average age of onset of HCC was around 60 years, and the average BMI was about 26, indicating a possible link with ageing-associated pathology and obesity.

Table 3 AJCC Cancer staging. The correspondence between the AJCC staging and the TCGA staging for LIHC is noted, along with the number of LIHC cases in each stage in the TCGA dataset. Control indicates the number of normal tissue control samples, and NA denotes cases where the stage information is unavailable
Table 4 Summary of key demographic features of the dataset. For continuous variables (age,height, weight and BMI), the mean ± standard deviation is given. BMI is calculated only for patients with both height and weight data

The dataset was processed through voom in limma to prepare for linear modelling [28]. At a p-value cutoff of 0.05, 14,843 genes were significant for the linear model given by eqn. (1). Even raising the bar to 1E-5, 9618 genes remained significant in the linear modelling, thus implying a strong linear trend in their expression across cancer stages relative to control. This was not entirely surprising since one of the hallmarks of cancer phenotype is genome-wide instability [20]. The linear modelling highlighted top ranked genes, some upregulated in HCC (GABRD, PLVAP, CDH13) and some downregulated (CLEC4M, CLEC1B, CLEC4G). The lfc for each stage with respect to control of top ten genes (ranked by adjusted p-value) are shown in Table 5, along with their inferred regulation status. Boxplots of the expression of the top 9 genes (Fig. 5) indicated elevated expression across cancer stages relative to control for up-regulated genes, while depressed expression across cancer stages relative to control was indicative of downregulated genes. (Boxplots of all other genes in the top 200 are provided in the Supplementary Fig. S1) It is worthwhile to note that a given gene might have maximal differential expression in any stage (not necessarily stage 4), and the linear trend does not suggest the order of expression across stages (Fig. 6).

Table 5 Top 10 genes of the linear model. The log-fold change expression of the gene in each stage relative to the controls are given, followed by p-value adjusted for the false discovery rate, and the regulation status of the gene in the cancer stages with respect to the control
Fig. 5
figure 5

Boxplots of top 9 linear model genes. For each gene, notice that the trend in expression could be either overexpression or downregulation relative to the control. For e.g., GABRD, PLVAP, CXorf36, CDH13 and UBE2T are overexpressed, while CLEC4M, CLEC1B, BMP10, and CLEC4G are downregulated. It could be seen that a linear trend does not imply maximal |lfc| in stage 4, as illustrated most clearly in the case of UBE2T

Fig. 6
figure 6

Boxplots illustrating stage-specificity of differentially expressed genes. Extremal expression in a stage could be either maximal expression or minimal expression relative to the control and all other stages, and could be termed maximal differential expression. Here we show genes with maximal differential expression in stage-I (WDR72; minimum expression), stage-II (GLI4, maximum expression; COLEC11, minimum expression), stage-III (CKAP2; maximum expression), and stage-IV (MAPK11; maximum expression)

A PCA of the top 100 genes from the linear model was visualized using the top two principal components (Fig. 7a). A clear separation of the controls and the cancer samples could be seen, suggesting the extent of differential expression of these genes in cancer samples. Hence linear modelling yields cancer-specific genes versus normal controls, and the results for the all the genes, including the top 100, are provided in order in Supplementary File S2. For comparison, a PCA plot of 100 randomly sampled genes (Fig. 7b) failed to show any separation of the cancer and control samples.

Fig. 7
figure 7

Principal components analysis of cancer vs control. a The first two principal components of the top 100 genes from linear modeling are plotted. It could be seen that control samples (red) clustered independent of the cancer samples (colored by stage). b The same analysis repeated with 100 random genes failed to effect a clustering of the control samples relative to the cancer samples

To ascertain an ordered trend of expression across cancer stages, the linear model given by eqn. (2) was fit. At a p-value of 0.05, 14,127 genes were significant, and raising the bar to 1E-5 still left 8032 genes significant. A goodness of fit with eqn. (2) does not equate with a monotonic trend of expression; i.e., a a gene with a significant linear fit is not required to follow a monotonic trend of mean expression with cancer stage. Using the definition of monotonicity given in the Methods section, we found 2109 genes showing strictly monotonic expression with the cancer stage and reaching maximum absolute mean expression in stage IV. Each such gene was annotated and ranked with the p-value from eqn. (2). This yielded 1977 genes with significant (i.e, p-val < 0.05) monotonic trends of mean expression across cancer stages, with 1602 upregulated and 375 downregulated. The top 20 such genes are presented in Table 6.

Table 6 Top 20 genes with significant monotonic patterns of expression. Intercept, Coefficient and Adj. p-value are from the linear model given by eqn.(2). Status indicates monotonic upregulation (UP) or monotonic downregulation (DOWN). The genes are sorted by significance (adj.p-value)

The results from the linear modelling were in contrast with those obtained by Huo et al. [25] and were most likely driven by an improved design and the inclusion of 51 controls in our study. These positive results provided the impetus to pursue stage-driven analysis. Given the conventional AJCC staging, gene expression differences would play a major role in driving the cancer progression. To identify the stage-specific differentially expressed genes, we applied the first contrast matrix (Table 2) and constructed the four-bit stage string of each gene. Based on the stage strings, we binned all the genes, and the string-specific gene lists corresponding to all the partitions in the Venn diagram (Fig. 4) is made available in Supplementary File S3. The size of each such partition is illustrated in Fig. 8. We eliminated the 16,135 genes corresponding to the stage string ‘0000’ (|lfc| < 2 in all stages). To establish the significance of the remaining genes, we applied the second contrast (Table 3) and passed each gene through the four filter criteria. The gradual reduction in candidate stage-specific genes as each criterion was applied, is shown in Table 7. Only genes that passed all criteria were retained as significant stage-specific differentially expressed genes. We obtained 2 stage-I specific, 2 stage-II specific, 10 stage-III specific and 35 stage-IV specific genes (Table 8). Figure 9 shows the volcano plot of these 49 stage-specific genes.

Fig. 8
figure 8

Venn illustration of the size of each 4-bit string. The numbers of genes with each pattern of differential expression are shown

Table 7 Number of genes in each step of the significance analysis. Differential expression is defined with respect to a threshold |logFC| = 2. Significance analysis proceeds first by significance (i.e, p-value) with respect to control, followed by p-value in each possible pairwise contrast between the different stages. Exclusive DE genes refer to genes differentially expressed in only one of the four stages (corresponding to the bit strings '1000', '0100', '0010' and '0001')
Table 8 Final set of highlighted genes in each stage. The genes in each stage are ordered by increasing adjusted p-values of the linear modelling analysis. Stage-IV specific genes with monotonic changes of expression correlating with disease progression are highlighted
Fig. 9
figure 9

Volcano plot of the 49 significant stage-specific differentially expressed genes. Stage 1 genes, red; Stage 2, blue; Stage 3, green; and Stage 4, orange. The genes are seen to orient away from the origin and the axes, indicating significance and effect size

In view of the limited sample size for stage-IV and consequent low power for rejecting false-positives, we stipulated that each stage-IV specific gene would display a smooth increasing or decreasing expression trend through cancer progression culminating in maximum differential expression in stage-IV. On this basis, we pruned the 35 stage-IV specific genes to just the top ten by significance in the linear modelling. This yielded a total of 24 stage-specifc genes of interest.

A heatmap of the lfc expression of these stage-specific genes across the stages was generated (Fig. 10a) and revealed a systematic gradient in expression relative to control, involving both downregulation and overexpression. The map was clustered on the basis of differential expression (i.e, |lfc|) both across stages and across features (i.e, genes) (Fig. 10b). It was seen that stage I genes clustered together, stage II genes co-clustered with NCAPG2 and DLG5 from stage-III, all the other stage-III genes clustered together, while the stage-IV genes formed two separate clusters. It was interesting to note that GABRD emerged as an outgroup to all the clusters, demonstrating its uniqueness.

Fig. 10
figure 10

Heatmap plots of the final 24 stage-specific genes. a heatmap generated from the lfc values of all the stage-specific genes (arranged stagewise). The color gradient spans the spectrum from downregulation (blue) to overexpression (red). Log fold changes upto sixfold are seen, indicating 64 times differential expression with respect to control. b Representation of the stagewise gene expression based on clustering of differential expression profiles

To identify the biological processes specific to each stage, we used the genes with maximal |lfc| in each stage and performed a stagewise gene set enrichment analysis on two ontologies, the GO and KEGG pathways. Salient results with respect to KEGG pathways are presented below (Table 9) and the complete KEGG and GO results are available in Supplementary Tables S1 and S2, respectively. In stage I, we found the significant enrichment of cell-cycle signaling pathways (Hippo, Wnt, HIF-1), and viral infection-related pathways (cytokine-cytokine receptor interaction, human papillomavirus infection, HTLV-I infection). In stage II, key signalling pathways (Ras, MAPK) were aberrant. Two liver-specific pathways, alcoholism and cytochrome P450 mediated metabolism of xenobiotics were enriched, as well as standard cancer pathways of bladder, brain, stomach, and skin that might involve generic genetic alterations necessary for cancer cell growth. In stage III, we noticed the significant enrichment of Metabolic pathways that summarize cellular metabolism. This might indicate the metabolic shift needed by the cancer to grow and invade neighboring tissues. Other salient significantly enriched pathways pertained to increased cell cycle progression, DNA replication, chemical carcinogenesis, p53 signaling pathway and cellular senescence, all hallmark processes critical to cancer progression. Stage IV gene set was significantly enriched for bile-related processes (bile secretion, primary bile acid biosynthesis), and ABC transporters (possibly conferring a drug-resistant advanced cancer phenotype). A signaling pathway related to diabetic complications was enriched as well, indicating the role of co-morbidities in driving liver cancer progression. The enrichment analysis of the top 100 genes of the linear model is included in the Supplementary Table S3.

Table 9 Gene set enrichment analysis. Stage-specific gene sets (all the differentially expressed genes, corresponding to row 'DE genes' in Table 6) were analyzed for significant enrichment with respect to KEGG Pathways. Significance was based on p-value <0.05

Discussion

When differentially expressed genes are identified in a two-class cancer vs control manner, the information about stage-specificity of differential expression is lost. By applying our protocol, this information is recovered and available for dissection. The top linear model genes and all the stage-specific differentially expressed genes (Table 10) were analyzed with respect to the existing literature.

Table 10 Stagewise effect sizes and significance of stage specific genes. The stagewise log foldchanges of differential expression of each candidate stage-specific gene in tumor samples relative to normal control samples are shown, along with significance values, and its inferred regulation status. In stage-IV, only the top 10 genes are shown. The stage-specificity of the genes are emphasized

Top genes of linear models

Three C-type lectin domain proteins (CLEC4M, CLEC1B, CLEC4G) were detected in the top ten genes of linear model given by eqn. (1). Interestingly, this identical cluster of three genes was detected as the most significantly downregulated liver cancer-specific genes in a qPCR study of an independent cohort of 65 tumor-normal matched cases [21]. On screening the top 200 linear model (1) genes against cancer driver genes in the Cancer Gene Census, only four genes were found, namely BUB1B, CDKN2A, EZH2, and RECQL4. The top 200 genes of the linear model given by eqn. (2) overlapped with 111 genes of linear model (1) and yielded six genes from the Cancer Gene Census, namely BUB1B, EZH2, CDKN2C, CANT1, POLD1, and STIL. Both CDKN2A and CDKN2C are cyclin-dependent kinase inhibitors. CDKN2A was a member of the gene signatures for HCC prognosis independently proposed by Gillet et al. [16] and Yang et al. [59]. It was remarkable that GABRD stood out as the top gene in both the linear models, and with a monotonic order of expression with the cancer stage. GABRD is discussed further in the section on Stage-IV specific genes. A gene with a monotonicity of expression may be increasingly upregulated as the cancer initiates, progresses and metastasizes, signalling its oncogenic progression; or conversely, it may be increasingly downregulated with the cancer stages, signalling the loss of tumor suppressor activity. Screening the top 200 genes with monotonic expression against the Cancer Gene Census yielded a completely different set of six genes: HSP90AB1, ALDH2, ESR1, PPP2R1A, HIST1H4I, SEPT5. HSP90AB1, a heat shock protein and molecular chaperone, was a key result of Xu et al. [56] where it played a dual role, one in the set of 50 hub genes correlated with Barcelona Clinic Liver Cancer (BCLC) staging of HCC patients, and another, in the set of 13 hub genes correlated with overall survival of HCC patients. HSP90AB1 might have a significant role in the aetiology of HCC, given that its expression is known to be upregulated by hepatitis B virus encoded X protein [31]. The monotonic changes in HSP90AB1 might further facilitate its known roles in angiogenesis [19]. The top 200 genes with monotonic expression had 15 genes in common with the top 200 of linear model (1) and 16 genes in common with the top 200 of linear model (2). However, only six genes were common to the top 200 of all three (namely GABRD, PIGU, NDUFA4L2, CRHBP, FLVCR1, TTC13; Fig. 11). NDUFA4L2 has been identified as a target gene of HIF-1 (hypoxia-inducible transcription factor-1), and a key factor driving the metabolic reprogramming in hypoxic micro-environments [46]. Our findings established that not only was NDUFA4L2 significantly overexpressed in HCC (as noted in [27]), but its overexpression follows a significant monotonic pattern across cancer stages, a much stronger statement that would support the role of NDUFA4L2 in driving HCC progression. Similarly, the expression of CRHBP has been recently shown to be negatively associated with the tumor size in HCC [55]. Our study provides a more quantitative account of the significant monotonic downregulation of CRHBP with the HCC stage. Two proteins of the glycosylphosphatidylinositol (GPI) anchoring system, PIGU and PIGC, were top genes with respect to significant monotonic expression (Table 6); of these, PIGU is a known bladder cancer oncogene [18].

Fig. 11
figure 11

Boxplot of top genes with monotonic expression. These six genes (GABRD, PIGU, NDUFA4L2, CRHBP, FLVCR1, TTC13) showed monotonic trends of expression across the cancer stages, and were topranked in both the linear models given by eqns. (1) and (2)

Stage-I specific DEGs (Fig. 12)

CA9 is a member of carbonic anhydrases, which are a large family of zinc metalloenzymes that catalyse the reversible hydration of carbon dioxide. Its expression in clear cell Renal carcinoma, but not in functional kidney cells has gained attention for its use as a pre-operative biomarker [30]. The WNT7B protein is part of the Wnt family, a family of secreted signalling proteins. Elevated WNT7B in pancreatic adenocarcinoma has been found to mediate anchorage independent growth [4]. Surprisingly, both CA9 and WNT7B are downregulated in HCC, most so in stage-I, contrary to their role in other cancers. A concrete interpretation of the role of these genes in HCC awaits appropriately designed experimental studies.

Fig. 12
figure 12

Boxplot of stage-I specific genes. It is seen that CA9 and WNT7B are both maximally downregulated in stage-I

It is pertinent to ask the following question here: which genes are essential for the initiation of HCC? Clearly these genes would be differentially expressed in stage I relative to control. All significantly differentially expressed genes with maximal |lfc| in stage-I would be the best candidates for genes involved in the initiation of HCC. These 122 genes are provided in the Supplementary File S3.

Stage-II specific DEGs (Fig. 13)

APOBEC3B, a DNA cytidine deaminase, is a known cancer driver gene in the Cancer Gene Census, but there are no literature reports of its stage-specificity in any cancer. It is known to account for half the mutational load in breast carcinoma, and its target sequence context was found to be highly mutated in Bladder, lung, cervix, neck, and head cancers as well [7]. Further studies have attributed specific hypermutation signatures across all cancers to the APOBEC family, including APOBEC3B [1]. Here APOBEC3B is upregulated, increasing its capacity to inflict the hypermutator phenotype, and highlighting an intriguing stage-specificity in its action. FAM186A polymorphisms have been reported in GWAS and SNP studies on colorectal cancer patients and shown to have a significant odds ratio in risk heritability [48]. FAM163A was a component of the 8-gene signature used for the risk stratification of HCC patients [39].

Fig. 13
figure 13

Boxplot of stage-II specific genes. It is seen that both APOBEC3B and FAM186A are maximally overexpressed in stage-II, the trend following an inverted U-shape

Stage-III specific DEGs (Fig. 14)

C12orf48, also known as PARI, participates in the homologous recombination pathway of DNA repair, and its overexpression has been reported in pancreatic cancer[35]. Further PARI was recently identified as a transcriptional target of FOXM1 [62], which is a well-validated upregulated gene in HCC [21]. DLG5 is a cell polarity gene and its downregulation has been implicated in the malignancy of breast [32], prostate [49] and bladder cancers [65]. It has been recently found that lower DLG5 expression is correlated with advanced stages of HCC and essential for invadopodium formation, an event critical to cancer metastasis [26]. It is surprising that our study has identified a stage-III specific upregulation in DLG5. Interestingly, evidence is emerging to lend support to our finding that DLG5 might be tumor-promoting. In a very recent review, Saito et al. [43] reinterpreted published results on cell polarity and cancer, and advanced an alternative perspective on the role of polarity regulators in cancer biology. They argued that both cellular and subcellular polarity would be regulated by DLG5 and related polarity proteins. Subcellular polarity might improve the cellular fitness for proliferation and stemness, thereby causing tumor promotion. Hence cell polarity regulation is anti-tumorigenic and subcellular polarity regulation is pro-tumorigenic, and our analysis has uncovered the pro-tumorigenic upregulated activity of DLG5. ECT2 encodes a guanine nucleotide exchange factor that remains elevated during the G2 and M phase in cellular mitosis. ECT2 is found to be upregulated in lung adenocarcinoma and lung squamous cell carcinoma [66], as well as in invasive breast cancer [52]. NCAPG2 is a component of the condensing II complex and involved in chromosome segregation during mitosis. NCAPG2 level were found to be increased in non-small cell lung cancer, and its over-expression was found to be correlated with lymph node metastasis, thus enabling the use of NCAPG2 as a poor prognostic biomarker in lung adenocarcinoma [61]. GNMT is a methyltransferase that catalyses conversion of S-adenosine methionine to s-adenosyl cysteine. In the absence of GNMT, S-adenosine methionine causes hypermethylation of DNA, which represses GNMT levels and is found in HCC samples [24]. This is an epigenetic mechanism for loss of function of tumor suppressors and our study here confirmed the downregulation of GNMT expression. PRR11 is found to be over-expressed in lungs, and its silencing using siRNA resulted in cell cycle arrest and apoptotic cell death, followed by decreased cell growth and viability [64]. A similar knock out experiment of PRR11 in hilar cholangiocarcinoma cell lines resulted in decreased cellular proliferation, migration, and tumor growth [9]. WDHD1 is a key post-transcriptional regulator of centromeric, and consequently genomic, integrity [23] and its overexpression has been identified as biomarker of acute myeloid leukemia [53], and lung and esophageal carcinomas [44]. C15orf42 has been implicated in nasopharyngeal carcinoma [3]. ORC6L overexpression has been identified as a prognostic biomarker of colorectal cancer possibly by enhancing chromosomal instability [54]. XRCC2 was found to increase locally advanced rectal cancer radioresistance by repairing DNA double-strand breaks and preventing cancer cell apoptosis [40]. XRCC2 was also highlighted in the gene signature for HCC prognosis advanced by Gillet et al. [16].

Fig. 14
figure 14

Boxplot of stage-III specific genes. Except for GNMT, the expression of stage-III specific genes show a peak in stage-III, with the expression trend following an inverted U-shape across the stages. The expression trend is convex and reversed for the downregulated GNMT, with minimum expression in stage-III

Stage-IV specific DEGs (Fig. 15)

GABRD, which was the top gene in the linear models as well, encodes for the delta subunit of the gamma-amino butyric acid receptor. The GABA receptor family was found to be frequently downregulated in cancers, except for GABRD, which was found to be up-regulated. Gross et al. [17] proposed that the GABA receptor gene family might play a role in the proliferation independent differentiation of cancer cells. GBX2 is part of the GBX gene family, which are homeobox containing DNA binding transcription factors. GBX2 is overexpressed in prostate cancer and studies show that expression of GBX2 is required for malignant growth of human prostate cancer [15]. PECAM1 overexpression has been linked to peritoneal recurrence of stage II/III gastric cancer patients [47]. CEND1 has been identified as a cell-cycle protein [50]. PGAM2 is a glycolytic enzyme whose upregulation is essential for tumor cell proliferation [57]. NR1I2 downregulation has been used in constructing a prognostic 9-genes expression signature of gastric cancer [51]. GDF5 has been shown to be a downstream target of the TGF-beta signaling pathway [33], stimulating angiogenesis required for the growth and spread of the cancer. GPR1 has been reported to be involved in promoting cutaneous squamous cell carcinoma migration [12]. Two other stage-IV specific genes, namely the downregulated CXCR2P1, which is a C-X-C motif chemokine receptor 2 pseudogene 1, and LOC25845, are minimally documented in the literature in the context of HCC, other cancers or any other condition. It is worth mentioning however that CXCR2, a member of the GPCR protein family binding the interleukin IL8, has been reported as an effective non-invasive blood based biomarker for HCC [45]. It is notable that ARHGAP42, a Rho GTPase activating protein, was another key result of Xu et al. [56], finding a place both in their set of 50 hub genes correlated with the BCLC staging of HCC patients, and in the set of 13 hub genes correlated with overall survival of HCC patients.Most of the stage-IV specific genes show contra-regulation (i.e, no clear trend) across cancer stages, and only 15 of the 35 genes revealed a monotonic pattern of expression (highlighted in Table 8). The other 20 genes could be unique to the hallmarks of stage-IV cancer, e.g., processes related to lymph node involvement and/or metastasis.

Fig. 15
figure 15

Boxplot of top 10 stage-IV specific genes. All genes, except NR1I2 and CXCR2P1, show a smooth increasing expression trend reaching peak expression in stage-IV. In the case of NR1I2 and CXC2RP1, the trend is reversed, with the expression decreasing smoothly to touch the minimum in stage-IV

Conclusion

We have developed an original protocol for the stagewise dissection of the HCC transcriptome. We were able to successfully fit a linear model across cancer stages and detected genes with a strong linear expression trend in the cancer phenotype. These genes were found to effectively separate the control and cancer samples. We were able to assign 2455 differentially expressed genes into one of four stages and visualized their stage specific expression using boxplots. Using a multi-layered approach, we were able to assess the significance of each stage-specific DEG and narrowed down to a handful of candidate significant stage-specific DEG’s. Our analysis yielded two stage-I specific genes (CA9, WNT7B), two stage-II specific genes (APOBEC3B, FAM186A), ten stage-III specific genes (including DLG5, NCAPG2, GNMT and XRCC2) and 35 stage-IV specific genes (including GABRD and CXCR2P1). Though most of these genes constituted novel findings in the context of HCC, a comprehensive literature search indicated connections with other cancer conditions. The analysis of monotonicity of expression has uncovered two genes with documented HCC connection, namely NDUFA4L2 and CRHBP. Correlation of our analysis with gene signatures based on the BCLC staging system revealed two common genes, namely HSP90AB1 and ARHGAP42. Our study might deepen our understanding of the mechanistic basis of HCC progression, and lay the foundation for the development of HCC diagnosis and treatment strategies. Translational research could transform our results into a panel of biomarkers for early clinical decision-making and rational drug development. It is straightforward to extend our computational methodology to the stage-based analysis of other cancers to obtain a fuller view of disease initiation, progression, and metastasis.