Background

Hepatocellular carcinoma (HCC) is a highly prevalent, treatment-resistant malignancy with a multifaceted molecular pathogenesis and is also one of the most common cancers and the third most common cause of death by cancer worldwide [1]. Geographic areas with the highest prevalence are located in Africa and Eastern Asia, likely due to the increasing prevalence of chronic hepatitis B or C. The incidence of HCC is also rising in the US. According to the American Cancer Society, an estimated 39,230 people were diagnosed with HCC in the United States during 2016 and about 27,170 people will die of the cancer [2].

Although significant effort has been directed toward the improvement of surgical and medical treatment, the prognosis for patients with advanced stages of HCC remains poor [3]. To improve diagnosis and treatment efficiency, a further understanding of molecular mechanisms of HCC progression is urgently needed.

Several research strategies, high-throughput genomic microarray in particular, have been used to investigate the molecular characteristics of HCC. Many molecular biomarkers with aberrant expression have been identified in HCC tissues, including NDRG1, Importin-α1, FOXP1, and PTPN12. Although these findings have greatly improved diagnostic and therapeutic strategies, some limitations remain. First, because of the large number of mRNA and proteins in the profiles, specific targets are difficult to identify. Second, changes at the mRNA level are not always consistent with those at the protein level, suggesting that a high level of background noise may exist. Third, traditional microarray methods identify mRNAs that are differentially expressed between normal vs cancer, but don’t consider the relationship between coding biomarkers and non-coding RNA, which are likely to be etiologically important.

Long noncoding RNAs (lncRNAs) are a class of noncoding RNA transcripts longer than 200 nucleotides that were previously believed to represent transcriptional noise. However, they have recently been identified as functional molecules. Emerging studies show that lncRNA play important roles in basic biology, ranging from transcriptional and post-transcriptional regulation to the control of cell cycle distribution, imprinting control, cell differentiation and tumorigenesis. Several lncRNAs have been reported to regulate the development of HCC and targets include HOXA transcription at the distal tip (HOTTIP) [4, 5], MEG3 maternally expressed 3 (MEG3) [6,7,8], highly upregulated in liver cancer (HULC) [9, 10], and ZNFX1 Antisense RNA 1 (ZFAS1) [11,12,13].

We hypothesized that combining mRNA expression profiles from traditional microarray methods and lncRNA expression profiles from lncRNA arrays may help identify a subset of candidate lncRNA-related and -coexpressed mRNA biomarkers with higher confidence and reliability. We defined lncRNA-related and -coexpressed mRNA biomarkers as mRNA biomarkers that are coexpressed with and have positional relationships with adjacent lncRNAs. We selected mRNA biomarkers based on two criteria: 1) differential expression between normal and cancer tissue samples; 2) a positional relationship to lncRNAs that are differentially expressed between normal and cancer samples.

In this paper, we first identify 3543 mRNA biomarkers differentially expressed between normal and cancer samples using t statistics and qvalues. Then we overlapped these transcripts with the results from lncRNA arrays to identify 41 lncRNA-related and -coexpressed mRNA biomarkers. Next, we performed network analysis, functional analysis and pathway analysis for the 41 lncRNA-associated mRNA biomarkers. Finally, we used an independent dataset and a SVM model to validate the prediction performance of the 41 identified lncRNA-related and -coexpressed mRNA biomarkers. Our results showed that combined microarray and lncRNA analysis improved biomarker discovery for the treatment of HCC.

Methods

lncRNA and mRNA microarray expression profiling

We downloaded Arraystar human lncRNA and mRNA microarray data for a cohort of 16 Human HCC samples and 16 adjacent non-tumor samples from GEO (GSE58043, GSE89186, GSE64631, and GSE55191, http://www.ncbi.nlm.nih.gov/geo) [14,15,16]. The Arraystar human lncRNA and mRNA microarray data included five positional relationship between lncRNA and mRNA: 1) “exon sense-overlapping”: the lncRNA’s exon is overlapping a coding transcript exon on the same genomic strand; 2) “intron sense-overlapping”: the lncRNA is overlapping the intron of a coding transcript on the same genomic strand; 3) “intronic antisense”: the lncRNA is overlapping the intron of a coding transcript on the antisense strand; 4) “natural antisense”: the lncRNA is transcribed from the antisense strand and overlapping with a coding transcript; and 5) “bidirectional”: the lncRNA is oriented head to head to a coding transcript within 1000 bp. If the Arraystar human lncRNA and mRNA microarray data showed no overlapping or bidirectional coding transcripts nearby the lncRNA, we defined the relationship between mRNA and lncRNA as “intergenic”.

Statistical analysis

We first performed a Box-Cox Power Transformation [17] using a powerTransform function (car package in R 3.4.0) to make the distribution of each mRNA and lncRNA in each sample approximately normal.

We used a two-sample, two-sided t-test [18, 19] to determine whether there was no difference between the mean of gene expression in HCC samples and that in normal samples. The null hypotheses was

$$ {H}_0:{\mu}_T={\mu}_N $$
(1)

where μT is the mean of gene expression in HCC samples, and μN is the mean of gene expression in normal samples. This null hypothesis was tested against the following alternative hypothesis:

$$ {H}_1:{\mu}_T\ne {\mu}_N $$
(2)

P values were determined by Welch’s t-test. Qvalues were adjusted for false discovery rate control using qvalue package from Bioconductor.

Pathway analysis

Network analyses were generated through the use of Ingenuity Pathway Analysis (IPA, Redwood City, CA, USA). The top scoring network of interactions was presented for the concurrent under-expressed and the concurrent over-expressed genes. This software analyzes molecular signatures in the context of known biological response and regulatory networks as well as canonical pathways.

DAVID functional analyses were used to identify biological functions that were most significantly enriched with expression changes [20].

Interpretation of biological pathways was conducted with the database we developed: Integrated Pathway Analysis Database (IPAD) (http://fzhang.w3.uvm.edu/ipad/) [21]. The enrichment scores used to select significant pathways were defined by p-value.

Performance measurement

We used the following five measurements for our evaluation: (1) Sensitivity (2) Specificity, (3) Precision, (4) Accuracy, and (5) Area Under the Curve.

$$ Sensitivity=\frac{true\ positive}{true\ positive+ false\ negative} $$
$$ Specificity=\frac{true\ negative}{true\ positive+ false\ positive} $$
$$ Precision=\frac{true\ positive}{true\ positive+ false\ positive} $$
$$ Accuracy=\frac{true\ positive+ true\ negative}{true\ positive+ true\ negative+ false\ positive+ false\ negative} $$

Results

The Arraystar Human lncRNA and mRNA microarray profiles contained two states (tumor vs. non-tumor) with 16 samples corresponding to each state. We obtained 3543 significantly differentially expressed mRNA biomarkers (mapped to 1932 genes) with qvalue < 0.05, among which 2066 (946 genes) were over-expressed and 1477 (986 genes) were under-expressed in HCC tumor.

Forty one mRNA biomarkers (Table 1) met our criteria of (1) being differentially expressed between 16 normal and 16 cancer samples with qvalue < 0.05; and 2) being positionally related to lncRNA which were differentially expressed between 16 normal and 16 cancer samples with qvalue < 0.05. The information corresponding to the positional relationships of mRNA/miRNA and lncRNAs were identified to predict the role of lncRNAs in regulating nearby genes. The positional relationship included exon sense-overlapping (7), natural antisense (16), bidirectional (9), intronic antisense(6), intron sense-overlapping (2), and sense overlap (1) (Table 2). Network analyses were performed with Ingenuity Pathway Analysis and the top four networks were identified (Table 3 and Fig. 1). We identified four networks: 1) Endocrine System Development and Function, Molecular Transport, Small Molecule Biochemistry; 2) Immunological Disease, Inflammatory Disease, Inflammatory Response; 3) Amino Acid Metabolism, Molecular Transport, Small Molecule Biochemistry; and 4) Metabolic Disease, Developmental Disorder, Hereditary Disorder. Gene ontology analysis with DAVID described the biological processes of the 41 mRNA biomarkers (Fig. 2a and b). Pathway analysis were generated using the IPAD [21] (Table 4). Pathways linked with the 41 mRNA biomarkers included Metabolism, Hemostasis, Cell Cycle, Signaling, Disease, Immune system, and Gene Expression, which are consistent with previous results we found [22,23,24,25].

Table 1 Forty one mRNA biomarkers with relationship to coexpressed lncRNAs (qvalue< 0.05 for mRNA and qvalue < 0.05 for lncRNA)
Table 2 Statistics for relationships
Table 3 Top networks involved
Fig. 1
figure 1

The 41 mRNAs biomarkers involved in Endocrine System Development and Function, Molecular Transport, Small Molecule Biochemistry. Red stands for over-expressed and green for under-expressed

Fig. 2
figure 2

Gene ontology biological processes enrichment analysis for 41 mRNA biomarkers. The GO terms were categorized into (a) biological processes at level 2 and (b) biological processes at level 3

Table 4 Pathway analysis for the 41 mRNA biomarkers

In order to validate the 41 mRNA biomarkers, we built a five-fold cross-validation Support Vector Machine (SVM) model based on all the 32 samples using a radius basis function kernels function. We achieved high prediction performance (AUC = 0.996, precision = 100%, accuracy = 96.9%, sensitivity = 93.8%, specificity = 100%). Further we randomly divided the 16 HCC samples and 16 Normal samples into two groups: training set and testing set. Each group contains 8 HCC samples and 8 Normal samples. We used the testing test to assess the performance of the trained 41 mRNA biomarkers. The testing set was blind and no data from the testing set were used for identification of the 41 mRNA biomarkers and development of the SVM model. We obtained high performances: for the training set (AUC = 1.0, precision = 100%, accuracy = 93.8%, sensitivity = 87.5%, specificity = 100%) and for testing set (AUC = 0.984, precision = 100%, accuracy = 93.8%, sensitivity = 87.5%, specificity = 100%) (Table 5).

Table 5 Validation with SVM for the 41 mRNA biomarkers

In order to validate the prediction performance of the 41 mRNA biomarkers for cross-platform dataset, we downloaded from GEO a RNA-seq dataset (GSE94660 [26]) which contains 21 HCC samples and 21 Non-neoplastic liver samples. We used the 38 genes as variables. The gene expression data in training microarray first was averaged by the 38 genes and then was normalized to 0–1 range by a min-max transformation function: y = (x-min)/(max-min). After normalization, a SVM model with five-fold cross-validation was used for learning the training set. The Reads Per Kilobase Million (RPKM) data in testing RNA-seq containing the 38 genes was normalized using the min-max transformation and used as blind testing set. The SVM model achieved high performances (AUC = 0.824, precision = 83.3%, accuracy = 78.6%, sensitivity = 71.4%, specificity = 85.7%). The results showed that lncRNA-related and -coexpressed mRNA biomarkers had high prediction accuracy within the training and testing sets.

Discussion

The top four networks we identified were similar to previously reported results [27,28,29]. For example, De et al. used high-density oligoarrays to identify consistent differences in gene-expression between HCC and normal liver tissue. Their network analysis of differentially expressed genes classified cellular and biological functions related to regulation of gene expression and post-translational modification in HCV-related primary HCC. These included Cellular Growth and Proliferation and Cell-To-Cell Signaling and Interaction in HCV-related non HCC samples; Cellular Growth and Proliferation and Cell Cycle in metastasis [30]. Xu et al. combined Chromatin immunoprecipitation (ChIP) on chip along with gene expression microarrays to create a genome-wide scale map of TFCP2 targets as well as the molecular function and pathways regulated by TFCP2 in HCC. They found that TFCP2-ChIP targets in SK-HEP-1 were functionally associated with cancer, cell movement, cell cycle, cell-to-cell signaling and interaction, cellular growth and proliferation [28]. Das et al. performed gene expression profiling between two groups of patients with HCV: one with HCC recurrence and second without recurrent HCC and revealed 194 differentially regulated genes between the two groups. They found that under-expressed genes were associated not only with HCC recurrence, but also with regulation of the innate immune response, cell-to-cell signaling and interaction, and the inflammatory response [29].

The Signaling, Disease, Metabolism, Cell Cycle, Immune system, and Gene Expression pathways linked with the 41 mRNA biomarkers were also reported in previous findings [22,23,24,25]. For example, two main pathogenic mechanisms were involved during hepatocarcinogenesis: (1) cirrhosis associated with hepatic regeneration after tissue damage caused by hepatitis infection, toxins or metabolic influences, and (2) mutations occurring in single or multiple oncogenes or tumor suppressor genes. Both mechanisms were linked in several important cellular signaling pathways. These signal pathways are of interest from a therapeutic perspective, because targeting them might help to reverse, delay or prevent tumorigenesis [24]. Numerous signaling modules including some related to growth factor signaling (e.g., IGF, EGF, PDGF, FGF, HGF), cell differentiation (WNT, Hedgehog, Notch), and angiogenesis (VEGF) have become a major source of targets for novel therapies in HCC. Different molecular mechanisms have been shown to induce aberrant pathway activation, such as point mutations, chromosomal aberrations, and epigenetically driven down-regulation [25]. Huang et al. investigated the role of EGF-EGFR signaling pathway in the development of human hepatocellular carcinoma (HCC) inflammatory environment by measuring the gene profiles of inflammatory cytokines from HCC. They found that HCC proliferation, metastasis and production of inflammatory cytokines were regulated via EGF-EGFR signaling pathways, which represent potential therapeutic targets for HCC [23].

Some biomarkers and their association with HCC already have been reported. For example, CDK4 (Cyclin Dependent Kinase 4) has been implicated in a number of cancer types. Jin et al. demonstrated the activation of cdk4 triggers and inhibitors of cdk4 for the prevention/treatment of Non-alcoholic Fatty Liver Disease [31]. Secreted phosphoprotein-1 (SPP1) was found to be overexpressed in metastatic hepatocellular carcinoma (HCC), and had potential to act as both a diagnostic marker and a therapeutic target for HCC [32]. Li et al. found that downregulation of Superoxide Dismutase 1 (SOD1) was correlated with histopathological grading and might be a good candidate gene for HCC [33].

LncRNA/mRNA expression profiling has been widely used for biomarker discovery of cancers, for example, liver cancer, gastric cancer, bladder cancer, colon cancer, pancreatic cancer, laryngeal cancer, and colorectal cancer. This is because LncRNAs is important in identifying biomarkers for various human cancers. Unraveling the co-expression pattern between mRNAs and lncRNAs can further help researchers better understand the mechanism of various human cancers. The lncRNA-related and -coexpressed method to detect mRNA biomarkers we presented in the study can work not only for HCC but also for all other human cancers.

The advantage of the lncRNA-related and -coexpressed method to detect mRNA biomarkers is that we consider the co-expression between mRNA and lncRNA, filter out some unimportant mRNAs and lncRNAs by setting significant threshold, and focus on the most important mRNAs and lncRNAs and their coexpressed networks. Compared with the original findings from the datasets (GSE58043, GSE89186, GSE64631, and GSE55191, http://www.ncbi.nlm.nih.gov/geo) [14,15,16], it shows that we found the Endocrine System Development and Function as top network associated with liver cancer. This finding is consistent with the fact that liver abnormalities have strong association with endocrine diseases [34, 35]. It would be difficult to discover the Endocrine System Development and Function as top network without using lncRNAs’ coexpression as one kind of filter. As shown in the result section, there are 3543 significantly differentially expressed mRNA biomarkers at first. It is the lncRNA coexpression filter in our method that shortens the number of biomarker candidates and unravels the final 41 mRNAs (38 genes) which functionally link to the Endocrine System Development and Function.

There is a limitation of the study with small sample size. We have tried our best to collect all the four available LncRNA/mRNA expression profiling related to HCC [14,15,16]. In the future, we will continue to collect HCC LncRNA/mRNA expression profiling data with our collaborators.

Conclusion

We developed a lncRNA-related and -coexpressed method to detect mRNA biomarkers with HCC. Top networks such as “Endocrine System Development and Function, Molecular Transport, Small Molecule Biochemistry” and enriched pathways such as Cell Cycle, Signaling, Metabolism, and Immune System were also discovered. Unraveling these intricate networks and pathways is essential to understanding the biological mechanisms of HCC development and progression. Our method has the potential to provide a basis for biomarker identification in HCC or other diseases.