A panel of Transcription factors identified by data mining can predict the prognosis of head and neck squamous cell carcinoma
Transcription factors (TFs) are responsible for the regulation of various activities related to cancer like cell proliferation, invasion, and migration. It is thought that, the measurement of TFs levels could assist in developing strategies for diagnosis and prognosis of cancer detection. However, due to lack of effective genome-wide tests, this cannot be carried out in clinical settings.
A complete assessment of RNA-seq data in samples of a head and neck squamous cell carcinoma (HNSCC) cohort in The Cancer Genome Atlas (TCGA) database was carried out. From the expression data of six TFs, a risk score model was developed and further validated in the GSE41613 and GSE65858 series. Potential functional roles were identified for the six TFs via gene set enrichment analysis.
Based on our multi-TF signature, patients are stratified into high- and low-risk groups with significant variations in overall survival (OS) (median survival 2.416 vs. 5.934 years, log-rank test P < 0.001). The sensitivity and specificity evaluation of our multi-TF for 3-year OS in TCGA, GSE41613 and GSE65858 was 0.707, 0.679 and 0.605, respectively, demonstrating good reproducibility and robustness for predicting overall survival of HNSCC patients. Through multivariate Cox regression analyses (MCRA) and stratified analyses, we confirmed that the predictive capability of this risk score (RS) was not dependent on any of other factors like clinicopathological parameters.
With the help of a RS obtained from a panel of TFs expression signatures, effective OS prediction and stratification of HNSCC patients can be carried out.
KeywordsHead and neck squamous cell carcinoma Transcription factors Overall survival The Cancer Genome Atlas
head and neck squamous cell carcinoma
The Cancer Genome Atlas database
gene set enrichment analysis
univariate survival analysis
multivariate Cox regression analyses
multivariate Cox stepwise regression
receiver operating characteristic
Head and neck squamous cell carcinoma (HNSCC) is a solid malignancy that is the sixth most common human cancer, with an annual incidence of more than 600,000 . A combination of chemotherapy, radiotherapy, and adequate surgical resection has transformed HNSCC from a universally deadly disease to a potentially curable one; nevertheless, fewer than half of all patients are saved, with a 5-year survival rate < 50% . Traditional stratification schemes based on multiple clinicopathological parameters such as the American Joint Committee on Cancer (AJCC) TNM staging system have been recognized as the primary criteria providing prognostic guidance for the management of patients with HNSCC [3, 4]. Despite the ease of its implementation and its wide application, TNM staging is insufficient for forecasting prognosis and estimation for subsets of HNSCC patients, and individual variation of survival times within the same stage is considerable [5, 6]. Risk scores (RS) that capture such individual variation might guide better therapeutic strategies. An increasing body of evidence suggests that molecular risk assignments could be used to promote prognostic assessment and identification of potential high-risk HNSCC patients [6, 7, 8, 9].
Proteins that bind to specific DNA sequences and control the transcription rate of genetic information from DNA to mRNA, are called Transcription factors (TFs) . Their role is to regulate genes (turn on and off) and ensure expression in the required cells at the appropriate time and at required quantities. Increasing amounts of evidence suggest that deregulation of TFs characterizes the majority of human cancers, and some have been associated with cancer diagnosis and prognosis [8, 9]. For example, p53 is a tumor suppressor protein, and mutations of this gene can be detected in more than half of all human cancers ; c-Myc is another important oncogene that is overexpressed in some malignant cancer cells and has been associated with tumor progression and poor clinical outcome . Because of the significance of TFs in many biological processes and their aberrant activity in human cancer, we hypothesized that expression patterns of TFs may act as potential prognostic biomarkers of cancer.
The current cancer sample datasets which can be accessed via the TCGA and other similar resources, are an abundant data source which can assist in the identification of biomarker signatures and predict disease outcomes [12, 13]. In our study, an extensive evaluation of the RNA-seq data across a 502 HNSCC patient cohort was carried out with the help of available TCGA datasets. Using a univariate survival analysis (USA) and multivariate Cox stepwise regression (MCSR) algorithm, we identified six prognosis-related TFs. Based on their expression in the TCGA series, a prognostic model was built and validated in another independent series (GSE41613 and GSE65858). Further MCRA and stratified analysis was used to confirm if the multi-TF signature was an independent indicator of HNSCC. Our investigation will put forward new insights in methods of overall survival (OS) prediction in patients suffering from HNSCC.
Patient data extraction
Gene expression data for HNSCC were download from the TCGA (https://cancergenome.nih.gov/) database. The HNSCC cohort comprised 502 tumor tissues and 44 adjacent normal tissues. The probe IDs were converted to gene symbols in these datasets based on their Ensembl gene IDs, generating a dataset including the expression values for each gene. Corresponding patient clinical data which includes the gender, age, alcohol consumption, histologic tumor grade, lymph node dissection, HPV status, TNM stage, PNI, ENE, and LVI are displayed in Additional file 1. The GSE41613 and GSE65858 data set was download from the GEO database as an external validation series. The microarray data of GSE41613 and GSE65858 were based on the Affymetrix Human Genome U133 Plus 2.0 Array platform and Illumina HumanHT-12 V4.0 expression beadchip, respectively. Probes were matched to the gene symbols with a manufacturer-provided annotation file.
Identification of predictive TFs
The RNA-seq data of HNSCC covered 18,101 coding genes containing 1639 TFs. The DESeq package in Bioconductor was used to screen the differentially expressed TFs in HNSCC (Padj < 0.05 and absolute log2FC > 1). TF expression values were transformed as the log2(x + 1) of normalized expression values for further analysis. After excluding patients without clinical survival information, 498 patients were chosen for the USA. TFs with a P-value of < 0.01 were selected for USA using the R survival package. TFs that passed this filter criterion were further analyzed with a multivariate Cox stepwise regression (MCSR) algorithm, as described previously [14, 15]. At each stage in the process, the deletion of each variable was tested with the help of a chosen model fit criterion. Based on whether the loss of a variable gave statistically insignificant deterioration of the model fit (F test), the variables were deleted till a statistically significant loss of fit was seen. Based on the estimated regression coefficients in the MCRA and the selected TFs, a risk score was then developed to combine the expression levels of six TFs (HOXA1, ZNF662, LHX1, ZBTB32, MEIS1 and HOXB8) in HNSCC specimens. In this study, the six-TF signature was defined as a multi-TF signature.
According to the MSCR algorithm, the RS of individual patients were estimated, and they were split into high- and low-risk subgroups based on the median RS cut off. This RS formula was further confirmed by the GSE41613 and GSE65858 dataset. Univariate Cox proportional hazards regression analyses was used to determine the predictive value of our multi-TF signature and other traditionally evaluated clinically relevant parameters, defining the hazard ratios and 95% confidence intervals. Multivariable Cox regression analyses were used to determine if the RS values were independent predictors in HNSCC patients. In stratified analysis, the prognosis power of our multi-TF signature in various clinical subtypes was determined by Kaplan–Meier analysis via log rank tests. The sensitivity and specificity of the RS was analyzed using receiver operating characteristic (ROC) analyses. For the log-rank tests, univariate survival analyses and multivariable Cox regression analyses. P < 0.05 was considered as statistically significant. All statistical analyses were performed with SPSS 24.0 (IBM, Armonk, NY, USA) and R 3.5.1.
A Java program (http://software.broadinstitute.org/gsea/index.jsp) was used to perform GSEA with the MSigDB C2 CP: Canonical pathways gene set collection. Cytoscape (version 3.6.0) was employed to visualize this GSEA. Using this, we could investigate the relationship between particular gene sets and risk scores for all genes, and identify the most positively and negatively associated ones with such enrichment scores. Totally, 1000 random sample permutations were carried out, with a significance threshold of FDR < 0.1 and P < 0.05.
Development of a multi-TF predictive model in the TCGA series
Six TFs that were significantly correlated with overall survival in HNSCC patients
Validation of the multi-TF signature
Determination of independent predictive activity of the multi-TF signature
Univariate and multivariate Cox regression analyses in TCGA cohort
HR (95% CI)
HR (95% CI)
Age (60 vs. > 60)
Lymph node neck dissection (yes/no)
Histologic grade (G1/G2/G3/G4)
TNM stage (I/II/III/IV)
Lympho-vascular invasion (no/yes)
Perineural invasion present (no/yes)
Extranodal extension (no/yes)
Identification of 6-TF signature correlated with biological pathways and processes
Currently, one of the main parameters to help clinicians determine patient outcomes and plan treatments, is the TNM staging; nevertheless, variation in outcomes suggests that clinical features cannot fully account for phenotypes of different potential subtypes [3, 4, 16]. Oncogenesis is characterized by several stages that need modifications in gene expression programs . TFs play important roles in controlling this. Therefore their dysregulation is a reason for the acquisition of tumor-associated properties . Previous studies [19, 20] reported that the expression patterns of TFs may be an effective means of grading tumor subtypes. However, to date, expression profiles based on TFs in HNSCC have not been clarified.
Our study was aimed at identifying a TF expression signature that could predict outcomes for HNSCC patients at individual levels. To this end, we evaluated the prognostic significance of all differentially expressed TFs in HNSCC that were chosen on the basis of USA of the RNA-seq data retrieved from TCGA. Unfortunately, the requirement to measure a number of genes, reduces the efficiency of prognostic biomarkers in clinical applications . Therefore, using an MCSR algorithm, a multi-TF signature was identified. This was more effective than individual TFs as predictive potential was maximized while the number of predictors were reduced [14, 15, 21, 22]. The results of MCSR suggested to us that we should construct a model consisting of six TFs that forecast the survival time of HNSCC patients.
Among these TFs, HOXA1 was previously reported as an oncogene in HNSCC. Upregulation of HOXA1 promoted the migration and invasion of HNSCC cells via the EMT pathway. More importantly, high levels of HOXA1 were discovered to be linked with poor prognosis of HNSCC . This finding accorded with our results. Another candidate HOXB8, similarly to HOXA1, was a member of HOX family that was found to be significantly linked with tumor metastasis and shorter overall survival in many human cancers [24, 25, 26]. Further investigation revealed that HOXB8 was a predictor of the effects of FOLFOX4 chemotherapy in metastatic colorectal cancer . Therefore, we hypothesized that HOXB8 may act as an oncogene in HNSCC progression; further investigation of this hypothesis is needed. Aberrant expression of ZNF662 caused by epigenetic changes via DNA hypermethylation was a valuable biomarker of tumorigenesis and advanced HNSCC . In our study, ZNF662 was expressed at low levels in HNSCC and was associated with shortened survival. Down-regulation of MEIS1 modulated the leukemic cell response to chemotherapeutic-induced apoptosis . Additionally, LHX1 was reported as a driver gene of clear cell renal cell carcinoma proliferation, apoptosis, and promoting tumor growth . In the present study, the up regulation of LHX1 was an indicator of poor prognosis of HNSCC. This suggests that MEIS1 may participate in the regulation of chemoresistance in HNSCC and may be potential targets for anti-HNSCC drugs in the future. A recent study showed that ZBTB32 facilitated transcriptional repressor Zpo2 targeting to the GATA3 promoter to downregulate GATA3 expression and activity. Modulation of GATA3 by ZBTB32 in turn caused the development of aggressive breast cancers . In our study, loss of ZBTB32 was associated with shortened survival time in HNSCC.
Taken together, the Kaplan–Meier analyses and ROC analyses demonstrated that expression of these TFs was a powerful predictor prognosis of HNSCC, suggesting its potential research value in the context of HNSCC.
Previous simulations have shown that the prognostic models which are significantly linked with survival times in the training data set can also be developed when using entirely independent dataset . In this study, the usefulness of this multi-TF signature was validated in the non-overlapping cohort in GSE41613 and GSE65858, indicating good reproducibility of this multi-TF signature in HNSCC.
Multivariate analysis showed that PNI and ENE were independent clinicopathological factors for predicting the risk of HNSCC. Perineural growth is an unusual means of tumor cells growth that is not least resistance; it indicated high risk of postoperative recurrence and was an important poor prognosis factor in HNSCC . ENE was defined as tumor cells infiltrating extranodal tissues beyond the capsule of affected lymph nodes. It was a characteristic of more aggressive cancer and was associated with shortened survival . In stratified analysis, we found that the multi-TF signature remained a powerful forecaster of prognosis within these subsets, suggesting that our multi-TF was independent of these important clinicopathological parameters. This result implied that our multi-TF signature has the potential ability to enhance clinical prognostic tests. This will assist in improving patient stratification and treatment planning accordingly in future trials.
As with all research, our study also has its limitations. For one, due limited data, out of the thousands of known and predicted TFs, we could only obtain 1639 gene expression profiles. In addition, some clinical information was incomplete, which made our study susceptible to the inherent biases. Finally, while GSEA was used to investigate biological processes associated with identified TFs, further studies are required to investigate their specific role in cancer.
In summary, by combining RNA-seq data with patient outcomes, we generated a powerful prognostic signature based on the expression patterns of 6 TFs. This multi-TF signature can predict the prognosis of patients with HNSCC in the TCGA dataset and was further validated in another independent dataset. More importantly, our 6-TF signature retained its ability to predict in tumor subtypes with varying clinicopathological parameters. Therefore, we show that the 6-TF signature is a potential outcome predictive method for HNSCC patients. It could also help with patient stratification on the basis of predicted therapeutic responses.
XZ designed the study. BZ and ZG participated in data downloading and preliminary analysis. BZ planned and wrote the manuscript with the help of HW. XZ and HW critically revised the manuscript. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
- 27.Li ST, Lu XG, Chi P, Pan J. Identification of HOXB8 and KLK11 expression levels as potential biomarkers to predict the effects of FOLFOX4 chemotherapy. Future Oncol. 2013;9:5727–36.Google Scholar
- 28.Zhao C, Zou H, Zhang J, Wang J, Liu H. An integrated methylation and gene expression microarray analysis reveals significant prognostic biomarkers in oral squamous cell carcinoma. Oncol Rep. 2018;40:52637–47.Google Scholar
- 29.Rosales-Avina JA, Torres-Flores J, Aguilar-Lemarroy A, Gurrola-Díaz C, Hernández-Flores G, Ortiz-Lazareno PC, Bravo-Cuellar A. MEIS1, PREP1, and PBX4 are differentially expressed in acute lymphoblastic leukemia: association of MEIS1 expression with higher proliferation and chemotherapy resistance. J Exp Clin Cancer Res. 2011;30:112.CrossRefGoogle Scholar
- 34.Mascitti M, Rubini C, De Michele F, Balercia P, Girotto R, Troiano G, Santarelli A, et al. American Joint Committee on Cancer staging system 7th edition versus 8th edition: any improvement for patients with squamous cell carcinoma of the tongue? Oral Surg Oral Med Oral Pathol Oral Radiol. 2018;126:415–23.CrossRefGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.