Background

Lung cancer remains the leading cause of cancer death in the world [1], in which non-small cell lung cancer (NSCLC) is the most frequent type of lung cancer, including adenocarcinomas (LAD), squamous cell carcinomas (LUSC) and large cell carcinoma (LCC) [2]. LUSC represents a major public health issue, accounting for 27% of all lung cancers. LUSC exhibits distinct epidemiological, clinicopathological and molecular characteristics [3]. However, effective biomarkers for early detection, prediction of high recurrence populations and risk of death and the identification of target therapies are still lacking. Thus, identification of effective biomarkers for the prognosis of LUSC is critical for the diagnosis and treatment of LUSC patients.

Long non-coding RNAs (lncRNA) regulate gene transcription and are implicated in diverse biological processes. With the importance of lncRNAs being investigated in cancer research, the alterations of the lncRNA landscape [4] and roles of lncRNAs as drivers of tumor suppression and oncogenesis have been identified [5]. Moreover, long non-coding RNAs in circulation have been found in patient blood samples and act as a novel biomarker in plasma for predicting NSCLC [6, 7]. This suggests that lncRNAs may be non-invasive biomarkers for lung cancer.

Although a number of lncRNAs have been identified for predicting the outcomes in NSCLC [5], the prognostic value of a single candidate lncRNA biomarker is limited. This may be due to the small sample sizes as well as inconsistent sample collection and detection methods in previous studies. Identifying lncRNA expression signatures that are associated with patient survival in standard clinical samples may lead to the discovery of molecular drug subclasses and potential drug targets. Several prognostic gene expression signatures have been published for NSCLC [8,9,10], but none of these studies includes lncRNAs in a large cohort to identify and assess the prognostic value of lncRNA biomarkers for LUSC patients. Moreover, the molecular characteristics [11] and prognosis pattern differ between LUSC and LUAD, and we focused on the lncRNA survival signature of LUSC not previously reported.

We applied a survival associated risk-score formula to identify a novel 7-lncRNA prognostic signature from the TCGA dataset of 388 LUSC patient samples. To show the robustness of this signature, the specificity and sensitivity of our model was tested by the area under ROC curve (AUROC) analysis.

Methods

Datasets

LncRNA RNA-seq data (HTSeq-FPKM-UQ) comprised of 504 LUSC patients was obtained from the publicly available Genomic Data Commons (GDC) data portal (https://portal.gdc.cancer.gov/). Corresponding clinical data, including age, gender, smoking history and TNM stage were also obtained and assessed. We excluded incomplete clinical data or overall survival (OS) of less than 1 month in this analysis. After exclusions, a total of 388 LUSC patients were enrolled in the development of our model. The 388 LUSC patients were randomly divided into a training set (n = 194) and a testing set (n = 194). Patient IDs in both training and testing sets are shown in Additional file 1: Table S1. The training set was used to identify the lncRNA expression signature, and the testing set was used for further validation.

Identification of differentially expressed lncRNAs in LUSC

All analyses were performed using R version 3.3.0. To identify lncRNAs suitable for subsequent survival analyses, we utilized the trimmed mean of M values method for normalization and differential expression analysis using the edgeR package from Bioconductor [12, 13]. The parameters for screening the expression difference of lncRNAs were padj <0.01 and |log2FoldChange| > 2.

Cox regression analysis

First, the RNA-seq expression values were transformed (log2) to normalize the data. The association between lncRNA expression and patient survival was determined by univariate Cox regression analysis using the Survival R package from CRAN [14]. The lncRNAs (p-value < 0.01) from the univariate analysis were used to mine potential candidate lncRNAs associated with OS. The Cox proportional hazard model was applied for multivariate analysis to identify covariates with independent prognostic value. The best mathematical model was built based on the Akaike Information Criterion (AIC) [15], which allowed for the determination of the best trade-off between the complexity of a model and its goodness of fit.

Risk score and survival curve

A mathematical formula (Risk score = 0.052*LINC01412 - 0.047*RP11-277P12.9 - 0.051*RP11-60H5.1 + 0.066*RP11-697M17.2 + 0.034*RP11-897M7.1 + 0.050*CTB-43E15.2 + 0.036*RP11-0.036*H4.1) was developed to predict the risk score for each patient based on the multivariate Cox regression analysis. According to our risk scoring system, patients were divided into low-risk and high-risk groups according to the median risk score. Subsequently, the log-rank test was used to determine the differences in survival. A Kaplan-Meier overall survival curve of the two groups was plotted and the hazard ratio was calculated. Cox multivariate analysis was also used to test whether or not the risk score was independent of the clinical parameters, such as age, gender, smoking history and tumor stage. The prognostic performance was measured using the Survival ROC R package from CRAN [16].

In silico functional pathways analysis

We examined the correlation between the expression level of the seven lncRNAs and each protein coding gene (PCGs) using two-sided Pearson correlation coefficients and the z-test [17]. The PCGs positively or negatively correlated with the seven lncRNAs were considered as lncRNA-related PCGs (|Pearson correlation coefficient| > 0.4 and P-value < 0.01). Gene ontology (GO) enrichment analysis of lncRNA-related PCGs was analyzed by using the DAVID online tool (https://david.ncifcrf.gov/) [18]. The GO terms with P-values of <0.05 were considered as significantly enriched functions of prognostic lncRNAs. Significant GO terms with similar functions were organized into an interaction network and visualized using the Enrichment Map plugin for Cytoscape 3.2.1 (http:// baderlab.org/Software/EnrichmentMap/) [19].

Results

Patient characteristics

According to the defined criteria, a total of 388 LUSC patients with both RNA-seq expression profiles and clinical data [20] were downloaded from the GDC data portal. The Clinical covariates of the patients and tumors in both training and test sets are showed in Table 1. Of the 388 patients, 183 had Stage I disease, 130 had Stage II, 69 were labeled with Stage III and 6 with Stage IV disease. For subsequent model development, we randomly divided all the patients into the training set (n = 194) and testing set (n = 194) as previously reported [21, 22]. There was no significant difference in the clinical covariates between the two sets (P > 0.05) (Table 1).

Table 1 Clinical covariates in the training and testing sets

Differentially expressed lncRNAs in LUSC patients

A total of 1414 lncRNAs were found to be differentially expressed between LUSC and normal lung tissues, and were used for survival analyses (Additional file 2: Table S2). To identify the lncRNAs which are associated with patient survival in LUSC, univariate Cox regression analysis for all lncRNA expression data was assessed [23]. With the significance level cutoff threshold of 0.01, a set of 16 lncRNAs were selected (Additional file 3: Table S3). These lncRNAs were used in stepwise multivariate Cox regression analysis and, finally, seven lncRNAs (LINC01412, RP11-277P12.9, RP11-60H5.1, RP11-697 M17.2, RP11-897M7.1, CTB-43E15.2 and RP11-366H4.1) were identified (Fig. 1). We conducted a risk score analysis of the seven lncRNAs to calculate the risk score for each patient [24]. The risk score formula for our model is listed in Table 2 (Risk score = 0.052*LINC01412 - 0.047*RP11-277P12.9 - 0.051*RP11-60H5.1 + 0.066*RP11-697M17.2 + 0.034*RP11-897M7.1 + 0.050*CTB-43E15.2 + 0.036*RP11-366H4.1). Of these seven lncRNAs, five were associated with high risk (LINC01412, RP11-697M17.2, RP11-897M7.1, CTB-43E15.2, RP11-366H4.1, Coef > 0) and two were shown to be protective (RP11-277P12.9, RP11-60H5.1, Coef < 0) (Fig. 1).

Fig. 1
figure 1

The expression heatmap of the seven prognostic lncRNAs. The expression pattern of the seven prognostic lncRNAs is correlated with patient risk scores

Table 2 7-lncRNA risk score model

The development of the 7-lncRNA prognostic model

We divided the patients into high-risk and low-risk groups according to the median risk score (value = 0.909) calculated from the expression levels of the seven lncRNAs. The log-rank test was used to determine the survival differences. As depicted in Fig. 2a, Kaplan-Meier curves indicated that the high-risk group was correlated with poor prognosis in the training set (p < 0.0001). ROC curves indicated that the AUC of the 7-lncRNA signature was 0.694 in the training set (Fig. 2b), which showed that the 7-lncRNA signature had a high specificity and sensitivity in predicting the overall survival time of LUSC patients.

Fig. 2
figure 2

Kaplan-Meier and ROC curves for the 7-lncRNA signature in the training set. a The differences between the high-risk (n = 97) and low-risk (n = 97) groups were determined by the log-rank test (p < 0.0001). Five year overall survival was 36.4% (95% CI: 25.5%-52.1%) and 65.3% (95% CI: 53.7%-79.4%) for the high-risk and low-risk groups, respectively. b ROC curves indicated that the area under receiver operating characteristic of 7-lncRNA model was 0.694

In order to validate the prognostic power of the 7-lncRNA model, the log-rank statistical test was performed in the testing set. Patients in the validation set were divided into low-risk and high-risk groups according to the previous median risk score of the training set (value = 0.909). As in the training set, statistically significant differences (P<0.05) between the low-risk group and the high-risk group were observed (Fig. 3a),indicating that our 7-lncRNA signature is suitable for the prediction of LUSC patient survival.

Fig. 3
figure 3

Kaplan-Meier and ROC curves for the 7-lncRNA signature in the validation set. a The differences between the high-risk (n = 103) and low-risk (n = 91) groups were determined by the log-rank test (p < 0.0001). Five year overall survival was 36.8% (95% CI: 26.1%-51.8%) and 61.9% (95% CI: 51.4%-74.6%) for the high-risk and low-risk groups, respectively. b ROC curves indicated that the area under receiver operating characteristic of 7-lncRNA model was 0.685

To verify whether or not the 7-lncRNA model could distinguish the risk from the pool of all LUSC patients when potential prognostic factors were taken into account, a multivariate analysis was performed to evaluate the independent prognostic value of the model. Among the demographic data associated with the prognosis of cancer, the results indicated that the 7-lncRNA signature served as a strong independent predictor of LUSC overall survival (high-risk, HR: 2.822, 95% CI 2.026–3.929, p < 0.0001, Table 3), compared with clinical data such as age, gender and TNM stage.

Table 3 Multivariable Cox proportional hazards analyses

Functional enrichment analysis of pathways correlated with the prognostic lncRNAs in LUSC

After the measurement of the correlation of the lncRNAs in our model and those of the PCGs, co-expression between 444 genes and at least one of the seven lncRNAs (|Pearson correlation coefficient| > 0.4 and P-value < 0.01) was found. The 444 PCGs clustered most significantly in the GO enrichment (Additional file 4: Table S4) categories major histocompatibility complex (GO:0042613, GO:0032395, GO:0023026) and membrane (GO:0005886, GO:0016021, GO:0009897, GO:0030666) in our analysis (Fig. 4). These results suggest that the lncRNAs of the signature may regulate genes that affect the adaptive immune system and the function of the cell membrane.

Fig. 4
figure 4

Functional enrichment analysis of PCGs correlated with prognostic lncRNAs. The lncRNAs of the signature enriched in immune response and the function of the cell membrane pathway

Discussion

Increasing evidence reveals that lncRNAs play crucial roles in the tumorigenesis and progression of lung cancer. Although several studies have identified a number of lncRNAs with prognostic value in NSCLC, no studies have focused on and analyzed the expression of lncRNAs in LUSC. Moreover, because LUSC has distinct molecular characteristics [11], single lncRNA expression patterns are not sufficient for accurate prediction of LUSC outcomes. Therefore, we focused on the prognostic lncRNA expression patterns in lung squamous cell carcinoma.

In the current study, 7 of the 1414 differentially expressed lncRNAs associated with overall survival of LUSC patients were identified. Using univariate Cox regression analysis and stepwise multivariate Cox regression analysis, a novel seven-lncRNA (LINC01412, RP11-277P12.9, RP11-60H5.1, RP11-697M17.2, RP11-897M7.1, CTB-43E15.2, RP11-366H4.1) signature was established and validated to demonstrate high specificity and sensitivity in predicting the overall survival time of LUSC patients.

In order to gain a further insight into the functional roles of the seven lncRNAs, the correlation between their expression levels and the co-expressed protein coding genes was analyzed. Bioinformatic analysis revealed that 444 co-expressed protein coding genes clustered most significantly in the major histocompatibility complex (MHC) and membrane proteins in GO enrichment categories (Additional file 4: Table S4). MHC molecules exert their role in the immunological recognition and participate in destruction of tumor cells. Végh et al. reported that the loss frequency of MHC class I molecules was 36% (5 of 14 cases) in primary lung carcinomas [25]. In addition, the loss of MHC class I and MHC-encoded transporter TAP-1, which is necessary in antigenic peptide transportation, has been observed frequently in lung cancer, although no relationship between the loss of these molecules and patient survival was determined [26]. It is possible this relationship was not found due to the small sample size employed in the study. Recently, MHC II NSCLC vaccines have been reported as potential immunotherapies for a range of NSCLC patients, including LUSC [27]. Passlick et al. found that immunologically relevant cell surface molecules are frequently expressed in primary NSCLC, which is consistent with our results. However, no evidence showed how MHC molecules impacted the course of cancer [28]. Since the MHC and membrane proteins play an important role in vaccine and immune therapy target design, understanding how lncRNAs epigenetically regulate adaptive immune function through MHC and membrane proteins, subsequently affecting LUSC survival, is crucial.

Conclusion

In summary, our study identified a novel seven-lncRNA prognostic signature as a specific predictor for LUSC patients. In addition to TNM staging and qualified sampling methods to avoid bias and intratumor heterogeneity, further molecular investigations, such as exploring the underlying mechanisms of these lncRNAs in LUSC development and using independent cohorts of large sample sizes from multiple institutions, are necessary in order to confirm these predictions.