Background

Lung cancer represents 12.7% of all new cancer cases and is also the most frequent cause of cancer-related deaths worldwide accounting for 18.2% (Ferlay et al. 2010). Tobacco smoke exposure accounts for the majority of lung cancer cases; however, almost 25% of worldwide cases occur in lifelong never smokers (Lee et al. 2011). Never smokers (NS) are defined as individuals who have smoked less than 100 cigarettes during their lifetime (Couraud et al. 2015). Consequently carcinogens in cigarette smoke are presumed to induce carcinogenesis in smokers; however, no such affront exists in NS implying an alternative mechanism. Thus, lung cancer in lifelong never smokers (LCINS) is considered a separate entity with adenocarcinoma predominance (Yano et al. 2008), over squamous cell carcinoma (SCC) (Bhopal et al. 2019). The UK Million Women Study found three significant risk factors out of 34 potential ones, to be associated with an increased incidence of LCINS and included non-white ethnicity, asthma requiring treatment and tall stature (Pirie et al. 2016).

Major contributors to lung cancer include secondhand tobacco smoking, especially with the changes in cigarette manufacturing practices and smoking habits allowing deeper aspiration (Gridelli et al. 2015), occupational carcinogen exposure, pollution, X-ray radiation and radon exposure (Field and Withers 2012). In populations with lung cancer, the 5-year survival varies from 4 to 17% depending on the stage and regional differences (Hirsch et al. 2017). Three studies have supported the finding that smoking duration rather than intensity serves as a predictor for lung cancer and the differences between the histological types. Additionally, the association of lung cancer with tobacco smoking differs by histological type according to the exposure of the respiratory tract site to tobacco smoke particles, with lower exposure to the peripheral and subpleural regions. In small cell and squamous cell carcinoma, a rapid increase in risk was observed with the number of cigarettes smoked following smoking cessation, whereas to a lesser extent in adenocarcinoma (Kenfield 2008; Doll and Peto 1978; Flanders et al. 2003). Nicotine-mediated tumour promotion is made possible through the activation of the α7 subunit of the nicotine acetylcholine receptors (α7-nAChR) or β-adrenergic receptors activating the signalling cascade. In NSCLC, the expression of (α7-nAChR) is higher in squamous carcinoma smokers than in adenocarcinoma (Pezzuto et al. 2019).

In NSCLC, local progression and metastasis has been associated with epithelial to mesenchymal transition (EMT). In adenocarcinoma and SCC, there is a strong correlation between the expression of EMT biomarkers and an advanced stage and poor differentiation (Mahmood et al. 2017). Little is known regarding the epigenetic changes of the molecular mechanisms involved in EMT in NSCLC. Cigarette smoking plays a vital role in promoting EMT and is associated with poor survival, cell migration and invasion in NSCLC through the deregulation of E-cadherin. Cigarette smoke condensation (CSC) induces EMT by downregulating E-cadherins, which are considered tumour suppressors and important molecular markers in smoking patients (Nagathihalli et al. 2012).

Robust sex differences in terms of higher incidence of DNA damage in men compared to women exist with a male cancer incidence predominance (Dorak and Karpuzoglu 2012). While these differences might be mediated through acute sex hormonal action paralleling the variation in circulating sex hormone concentrations, factors independent from acute hormone action underlie this discrepancy. However, there is insufficient data to make a clear determination regarding the impact of sex, primarily due to the fact that sex differences have not been exclusively analysed. With LCINS being the seventh most prevalent in the world, a higher proportion also seems to be amongst women in comparison to men (Bhopal et al. 2019). Oestrogen receptor α and β are more frequently expressed in lung tissue compared to normal tissue, whereas progesterone receptors are less frequently expressed in cancerous tissue, with the receptor’s expression profile possibly correlating with the outcome (Couraud et al. 2012). In brain tumour glioblastoma, sex differences play a role in patient outcomes beyond hormonal influences, appearing to be intrinsic to the tumour cells. Yang et al. proposed that patient sex correlates with prognosis as well as treatment (Yang et al. 2019).

Recent work identifying further mechanisms of acquired resistance in NSCLC necessitates additional detailed understanding of the biological underpinnings of this tumour phenotype. An improved understanding of LUAD tumour biology accounting for smoking status and sex will allow a better understanding of available well-known modulators of treatment response and toxicity. FDA has been recommending systematic analysis of sex differences that will allow precision oncology rather than just exploring molecular markers. This paper presents a few major genetic changes occurring in current and non-smokers LUAD tumour samples and examining the effect of smoking on both sexes and EMT. Our approach to a pathway-specific gene association analysis will help form a better view of the accumulative effect of group functionally related genes aiding in revealing the transcriptional program accounting for the variability in cancer phenotype.

Materials and methods

Data collection

The expression profile data of lung adenocarcinoma level 3 normalised (illuminahiseq_rnaseq) including 522 samples with their corresponding clinical information were manually retrieved from The Cancer Genome Atlas (TCGA) database (http://gdac.broadinstitute.org/ data status of Jan 28, 2016). For validation, the gene chip GSE10072 of LUAD with its clinical manifestation data was also downloaded from the Genome Expression Omnibus (GEO) database for 180 patients (https://www.ncbi.nlm.nih.gov/geo/). Raw data has been processed, and patients with missing clinical/expression values were excluded from the analysis. Finally, the study included 146 patients from TCGA and 45 probes from GEO databases. Further grouping according to smoking status included a permissible value of 1 for lifelong non-smokers (NS) and 2 for current smokers (CS). We excluded reformed smokers to adjust for smoking because residual confounding could lead to some of the excess hazards amongst the groups. Our study included 54 NS of which 13 were males and 41 females and 92 CS of which 59 and 33 were male and female LUAD patients, respectively. A series of clinical traits for TCGA patients are shown in Supplementary Table S1.

In silico analysis

To further explore the optimal molecular subgroups in LUAD, unsupervised hierarchical clustering was performed on the patient tumour samples (T), with Pearson correlation and the pairwise complete-linkage method, separately on the subgroups (NS and CS). ExpressCluster software (CBDM Bioinformatics, Scott Davis n.d) was used to find common and unique expression profiles of genes showing differentiation between the patient T subgroups. Profiles indicating contrast between CS and NS subgroups were considered significant. Clustering was also performed by applying K-means++ algorithm, Z-norm signal transformation and rank correlation distance metric. p values were corrected using Benjamini and Hochberg adjustment to minimise errors, and adjusted p values < 0.05 were considered significant. Profiles indicating contrast between CS and NS were considered significant. To investigate the molecular differences amongst the groups, associated clinical manifestations including history of smoking, treatment response, sex, age, stage and vital status were analysed.

Multivariate and principal component analysis

Multivariate analysis (MFA) was used to estimate the relationships amongst gene expression patient subgroup correlates, in order to identify partitioning of patient subgroups. MFA standardises variables in predefined blocks of data calculating the global axes, which are the linear combination of the original parameters, thus maximising the global data variance. MFA was computed using the R packages FactoMineR and factoextra (Lê et al. 2008). Principal component analysis (PCA) was used to reduce the dimensionality of the subsets, creating uncorrelated variables and maximising variance (Jolliffe and Cadima 2016) between subgroups.

Gene set enrichment analysis

Gene set enrichment analysis (GSEA) (Subramanian et al. 2005) was carried out to determine whether the identified set of genes showed significant molecular functions between the patient subgroups. Datasets and phenotype label files were created (defined either as CS or NS and males or females) and loaded onto the GSEA software (v3.0). Enrichment analysis was performed according to the expression data for NS and CS, separately, to all 20,501 genes in terms of canonical pathway GeneSet (CP), cellular components (CC), molecular function (MF) and transcription factor targets (TFT) obtained from the Molecular Signatures Database (MSigDB). The number of permutations was set to 1000, and a ranked list metric was generated using t test yielding the functions involved in both the high and low expression groups. The nominal p value of enrichment results cut-off was 0.05, while gene clusters with false discovery rate, FDR < 0.25, were considered statistically significant. FDR < 0.25 was chosen rather than 0.05 as according to GSEA, an FDR < 0.25 is likely to generate interesting hypotheses for further research (Subramanian et al. 2005). Random seed for permutation to achieve repeatability was 779,948,241, with the parameter of permutations set as a 1000.

Weighted gene correlation network analysis

WGCNA R package was used to construct a scale-free network from gene expression data (Langfelder and Horvath 2008). Its algorithm was used to screen out the power value in the construction of modules. The appropriate power (9) was used following the use of the gradient method testing the independence and the average connectivity degree of different modules. We made sure to select the power when the degree of independence R2 was greater than 0.8 (R2 = 0.94 and slope = − 1.48). We then proceeded to construct the modules with minModuleSize set as 30 to achieve high reliability of the results. Following that, the correlation was transformed into adjacency matrix then into a topological overlap measure (TOMplot). Heatmap tools package (Babicki et al. 2016) was used to analyse the strength of the interactions. Module-trait associations were estimated using the correlation between the module eigengene and the smoking status (clinical trait), which allowed identification of module (expression set) highly correlated to the phenotype. Gene significance (GS) for each expression profile was calculated as the absolute value of the correlation between the expression profile and the trait; module membership (MM) was defined as the correlation of expression profile and each module eigengene.

Functional enrichment analysis of the co-expression modules

The constructed modules had different numbers of genes, and functional enrichment analysis was performed on the significantly distinct modules. Gene ontology (GO) analysis results from GOenr (enrichmentAnalysis) in WGCNA were extracted out. p value ≤ 0.05 after correction was used as a threshold. GO terms that were significantly overrepresented in a set of genes were then visualised in Cytoscape 3.5 (Shannon 2003), to further explore functions and pathways related to DEGs in the subgroups. All metabolic pathways generated were further subjected to enrichment and topological analysis using Java Enrichment of Pathways Extended to Topology (JEPETTO) plugin, and co-expressed modules were constructed by the Cytoscape software to identify hub genes. JEPETTO allows identifying novel and significant relationships using protein-protein interaction networks based on KEGG database (Winterhalter et al. 2014). It helped us analyse the functional association of the top-ranked underlying networks between DEGs and known cellular pathways. ViSEAGO R package (Brionne et al. 2019) was also used for biological interpretations of biological processes (BP) category using the GO public database. Associated gene terms were retrieved from EntrezGene database (Nov 2019) for Homosapiens ID 9606, and experimental GO terms from orthologous genes were added. Enrichment tests were performed using Fisher test, subsequently with classic, elim, weight and weight01 algorithms. We used several algorithms to validate the robustness of enrichment. Enriched GO terms (p < 0.01) were grouped into functional clusters using hierarchical clustering with the help of GOSemSim (Yu et al. 2010) based on Wang’s semantic similarity between GO terms respecting GO graph topology and Ward’s criterion. Further, these functional clusters were grouped using hierarchical clustering based on BMA distance between sets of GO terms and Ward’s criterion.

Statistical analysis

Mutation and the frequency of copy number alterations (CNA) were analysed for association between smoking groups. Mutation and CNA were downloaded from cbioportal website (https://www.cbioportal.org/) for lung adenocarcinoma (TCGA, Firehose Legacy); data was merged according to the patient ID. The Pearson’s chi-squared test was used to compare the mutation and CNA frequency between the subgroups. A Bonferroni-corrected p value of < 0.05 was used to determine the significance threshold for all association tests. Differences in the EMT marker gene expression signatures according to patient category (both sex and smoking) were examined using the Kruskal-Wallis analysis of variance, followed by the Dunn post hoc with Bonferroni correction for multiple comparisons. p < 0.05 was considered statistically significant. Computation was performed using the dunn.test R package, as well as GraphPad 5.01 (GraphPad Prism7 Software n.d, CA, USA).

Results

Unsupervised cluster analysis identifies six subgroups for LUAD current and non-smoker patients

Unsupervised hierarchical clustering analyses of 120 CS and 75 NS tumour samples were conducted with 20,501 genes downloaded from TCGA. These lung tumour samples were then divided into six subgroups, three each, based on molecular signature differences as shown in Supplementary Fig. S1. CS subgroup 1 (C1) contained 54 samples, subgroup 2 (C2) contained 38 samples and subgroup 3 (C3) contained 28 samples, whereas for NS, subgroup 1 (N1) contained 17 samples, subgroup 2 (N2) 37 samples and subgroup 3 (N3) 21 samples.

Classification of the patient subgroups and identification of DEGs

Gene set enrichment analysis (GSEA) allowed identification of genes contributing to the differentiation of the smoking status of the subgroup tumour samples. The 100 best genes (obtained from GSEA heatmaps) differentiating the compared CS vs NS subgroups were extracted following 9 different group combinations (Supplementary Fig. S2). PCA was used to explore the 6 patient subgroups, assessing the variation across each category (CS or NS) according to smoking status (Supplementary Fig. S3). Subgroups C1, C2 combined verses N1 and N2 also combined were mutually independent and highly clustered in intra-subgroups, thus dividing them based on their molecular differences. Subgroups N3 and C3 were eliminated from further analysis, rendering four subgroups best suitable to discriminate LUAD tumour samples according to their smoking status (Supplementary Fig. S4). We then identified from ExpressCluster 12 clusters showing unique expression profiles differentiating C1, C2, N1 and N2 subgroups yielding 2454 DEGs. Multivariate analysis of 2454 gene expression revealed clear distinction between C1, C2, N1 and N2 (Supplementary Fig. S5).

Gene co-expression network reveals 1150 DEGs corresponding to smoking status

Expression values of the 2454 genes were used to construct the co-expression modules by WGCNA package tool. To identify modules of these genes, the topological overlap-based dissimilarity was determined and used as a distance measure with a constant height cut-off value (6 × 105) (Supplementary Fig. S6). Soft threshold power was set to 9 in which R2 was 0.94 to ensure a scale-free network (Supplementary Fig. S7). This resulted in merging of highly correlated eigengenes, and the 9 best modules differentiating between CS and NS were chosen (Supplementary Figs. S8 and S9). We then combined the module genes and removed duplicates generating 1150, permitting the identification of distinct transcription modules for subgroups (CS and NS) and considered as our candidate genes.

Functional annotation analysis of DEGs

To gain insights into the biological functions that may be affected due to tobacco exposure, we performed functional analysis based on gene ontology (GO) using GSEA, JEPETTO plugin in Cytoscape and ViSEAGO package in R. GSEA enrichment of gene sets according to GO functional terms associated with cellular process led to the identification of processes mostly relating to ECM regulation and replication fork (Supplementary Table 2 and Supplementary Fig. 8), while cell cycle, DNA repair, DNA replication and proliferation and E2F pathway and transcription family targets for canonical pathways (Supplementary Tables 3 and 4 and Supplementary Fig. 10). The biological process controlled by the genes in the “turquoise” module was found to be mainly enriched in cell cycle regulation, division, DNA replication, repair and E2F pathway, while “grey” module genes were associated with proliferation, transcription, cell to cell signalling and P53 downstream signalling pathway and other modules such as “magenta” with cytoskeletal organisation, “dark orange” with skeletal development and “black” with immune responses (Supplementary File 1). Pathway analysis using JEPETTO plugin in Cytoscape revealed that cell cycle, DNA replication and repair pathways (MMR, BER, NER), HR, ErbB GnRH, TGF-β, T cell receptor and VEGF signalling pathway as major pathways varying with different constitution for CS and NS analyses and asthma (Table 1). However, when analysed separately, overexpressed genes in CS seem to correlate with mismatch repair, cell cycle, NHEJ and ABC transporters, while in NS, only asthma seemed to be significantly enriched (Supplementary Table 5).

Table 1 Genes involved in major cancer-related pathways coloured according to increased expression, CS in red and NS in blue with their corresponding NCBI Reference Sequence accession number

Smoking dysregulates cell cycle

Of our 1150 genes JEPETTO analysis for cell cycle related genes (CD22, BTK, PIK3R5, PIK3CG, CD19, PRKCB, INPP5D, NFATC2, PIK3R1, PIK3AP1, FCGR2B and RAC2) showed that their expression was elevated in NS compared to CS. The expression levels of these genes PIK3R5, PIK3CG, PIK3R1 NFATC2 and RAC2 in the NS cohort also seemed to be elevated in mismatch repair, homologous recombination, DNA replication, VEGF signalling pathway, T cell receptor signalling pathway, ErbB signalling pathway and GnRH signalling pathway (Table 1). GSEA analysis of the 1150 genes brought up REACTOME_E2F_MEDIATED_REGULATION_OF_DNA_REPLICATION (Table 2).

Table 2 Reactome pathways from functional enrichment STRING for 1150 DEGs, cell cycle and E2f related genes (CS in red, NS in blue)

We also found 113 genes relating to cell cycle (Table 3) to be overexpressed in our CS cohort; 28 were identified with a log fold change over 1.5. Further GO analysis of the 113 genes using ViSEAGO R tool returned 46 clusters, relating mostly in to DNA replication, cell cycle checkpoint, regulation and response to chemotherapy drugs (Fig. 1), in particular GO:0051726, regulation of cell cycle; GO:0010564, regulation of cell cycle process; GO:0051726, regulation of cell cycle; and GO:1901987, regulation of cell cycle phase transition (Supplementary File 2). Additionally, BLM and RAD51 were statistically significant and enriched with overexpression in CS for GO:1901563, response to camptothecin; GO:0097329, response to antimetabolite; and GO:0072710, response to hydroxyurea (Supplementary File 2) (Supplementary Table 14).

Table 3 113 cell cycle-related genes with log fold change values LogFC < 1.5
Fig. 1
figure 1

Functional analysis of 113 genes. The clustering heatmap plot of functional sets of gene ontology (GO) terms was obtained using ViSEAGO showing the major biological processes, cluster number and number of genes in each cluster. Based on BMA semantic similarity distance and Ward’s clustering criterion

Analysis of multiple omics data

Of the 1150 DEGs, we identified 130 genes with log fold change − 1.5 < log2FC < 1.5 and reduced them to 25 genes with a log fold change of − 2 < log2FC < 2 (Supplementary Table 6), indicating distinct molecular signatures between CS and NS LUAD patients with biological differences amid them between (Table 4). The log2 ratio is interpreted as the average log fold change in gene expression between the smoker groups. Fifteen genes seem to be overexpressed in NS and 10 in CS. Enrichment analysis of those 25 genes did not bring up significant metabolic pathways, but a few of them seem to be reported in the literature as tumour suppressors or involved in EMT process. SCGB1A1 seems to be an important defence molecule of the lung and one of the top hypermethylated genes in smokers. PRH2 has not been reported in the literature regarding any inflammatory diseases, however showed a log2FC of 2.76 in CS.

Table 4 1150 DEGs with log fold change values − 2 < LogFC < 2

Characteristics of LUAD patients related to different EMT profiles in association with smoking status

We performed dimensional grouping through MFA of LUAD patients according to the smoking status of the different subgroups. Our results show a contradictory pattern for EMT members between NS and CS. ITGA5, TCF3, SMAD3, KRT5, SNAI1, KRT18, DSP and SMAD2 are overexpressed in CS and underexpressed in NS, whereas VIM, SDC1, CDH11, MUC1, MMP2, ZEB2, ZEB1, LAMA2, TCF4, DDR2, CDH1, LAMA4, MMP3 ITGB6 and OCLN are underexpressed in CS and overexpressed in NS (Fig. 2). Kruskal-Wallis analysis on all 43 EMT genes identified 25 to be statistically significant. None showed significance across all 6 group combinations (C1C2, C1N1, C2N1, C1N2, C2N2, N1N2). However of the statistically significant EMT genes, the Dunn post hoc test for both smoking subgroups C1 and C2 showed significant difference in the expression of KRT8 (p < 0.0108) and TCF3 (p < 0.0134), while LAMA2 (p < 0.0281), LAMA3(p < 0.010), LAMA5 (p < 0.05) and TJP1(p < 0.0423) differed for both never smoker subgroups (N1, N2) (Supplementary Table 13).

Fig. 2
figure 2

The heatmap for EMT markers in current and never smoker LUAD patients, showing contrasting profiles of expression between the smoking groups for some genes

Gene expression analysis with special reference to EMT markers according to smoking status and sex

Following analysis of 144 EMT markers, hierarchical clustering identified 57 genes with differing expression in relation to smoking status and gender in our patient groups: female never smokers (FN), male never smokers (MN), female current smokers (FC) and male current smokers (MC) (Fig. 3). We found 4 genes to be statistically significant between all four groups using Kruskal-Wallis and Dunn post hoc test (FN, FC, MN, MC): OLFM1 (p < 0.0001), HEY2 (p < 0.0378), SFRP1 (p < 0.0109) and STRAP (p < 0.0001). HEY2, OLFM1 and STRAP showed statistically significant difference between FN and FC groups, OLFM1 and STRAP between groups FN and MC and, finally, SFRP1 for MN and MC (Supplementary Table 7).

Fig. 3
figure 3

The heatmap for EMT marker genes showing contrasting profiles of expression in relation to smoking status and gender in our patient groups: female never smokers (FN), male never smokers (MN), female current smokers (FC) and male current smokers (MC)

Gender-related differences in relation to smoking status

We also performed GSEA enrichment analysis for male and female LUAD patients. In males of the significantly enriched datasets included “calcium mediated signalling”, “MYD88 dependent TLR receptor signalling pathway”, “Negative regulation of IL-6 production”, “Regulation of PI3K”, “Inositol phosphate biosynthetic process”, “Response to wounding”, “Respiratory burst”, “Sialyation”, “TLR4 signalling pathway” and “Positive regulation of stem cell differentiation”, whereas within female patients of the upregulated gene sets seemed to be focused around cell division, cell cycle regulation, nuclear division, chromatin remodelling and segregation and extracellular matrix organisation (Supplementary File 3). The log fold change between both sexes was also calculated, with PLA2GA and PRG4 presenting with a 1.5 < log2FC < − 1.5 in both NS and CS groups (Supplementary Table 8). Additionally DNA repair pathways BER, MMR and HR and DNA replication had a higher gene expression in males compared to females (Supplementary Table 9).

Mutation analysis in different smoking subgroups

The total mutation count was found as expected to be higher in CS with a total of 10,471 in the 92 patients, when compared to a total of 1662 mutations in the 55 NS patients (data not included). We also found 7 significant gene mutation in all 4 subgroups EGFR p value = 0.002245, LRP1B p value = 0.006958, ZFHX4 p value = 0.006958, MUC16 p value = 0.007124, RYR2 p value = 0.02769, NAV3 p value = 0.03284 and TP53 p value = 0.04229 (Supplementary Table 10). BRAF mutation was also only found in CS with none at all in our NS cohort, in parallel with a few other studies (Paik et al. 2011). There are several other studies (Chapman et al. 2016; Midha et al. 2015; Dias et al. 2017; Takamochi et al. 2013) supporting our result with the higher incidence of EGFR mutation in NS groups compared to current/heavy smokers.

Discussion

The goal of this study was to examine the consequences of tobacco smoking on lung adenocarcinoma (LUAD) transcriptome differentiation. Here we highlight that smoking promotes gene expression changes critical to tumour differentiation rather than initiation. Lung cancer is the global cause of death and 80–90% of these cases attributable to tobacco exposure. With over 60 carcinogens in cigarette smoke, the effects of tobacco on lung cancer genome are well established, making it a major influence on mutational burden driven by the exogenous carcinogens and endogenous DNA damage (Yoshida et al. 2020). Smoking lung cancer patients have the highest somatic alterations, approximately 10 times as many as those from non-smokers (Vogelstein et al. 2013), with G to T transversion being a molecular signature of tobacco smoke (TS) mutagens in lung cancer due to PAH exposure (Pfeifer et al. 2002). Additionally, TS increases driver mutations, cell-to-cell heterogeneity and other indirect effects such as inflammation, immune suppression and infection (Yoshida et al. 2020).

This is the first study, to our best of knowledge, to show systemic differences in gene expression of major processes involved in carcinogenesis including: cell cycle; DNA repair pathways (MMR, HR, BER, NER); DNA replication; and ErbB, GnRH, VEGF, TGF-β and T cell receptor signalling pathway between LUAD tumour samples in relation to smoking status. We also found previously reported asthma associated genes amongst our DEGs with elevated expression in non-smokers (NS). Asthma is suggested to increase lung cancer risk perhaps through damage caused by trauma or inflammation. Even non-smoking asthmatic patients seem to have an increased risk for lung cancer (Pirie et al. 2016; Qu et al. 2017). Damage caused by smoking amongst other factors affects DNA repair pathways and thus is linked to the genetic predisposition to cancer in several different tissues. For instance, inherited mutations in DNA mismatch repair (MMR) genes lead to autosomal dominant inheritance of hereditary nonpolyposis colorectal cancer (HNPCC) or Lynch syndrome (Arora et al. 2017) and homologous recombination (HR) in Fanconi anaemia, Nijmegen syndrome and Bloom syndrome. These syndromes all result from chromosomal instability with the genes identified involved in aspects of DNA damage repair (Taylor 2001). Moreover, HR contributes to the repair of DSB in mammalian cells and is necessary for cell viability when exposed to ionising radiation which occurs during late radioresistant S phase of the cell cycle and the rather not radioresistant G2 phase of the cell cycle (Zhong et al. 2016). We share the opinion of other researches that partial or temporary inhibition of recombination processes merged with the radiotherapy may be an approach to cancer treatment in the nearby future.

In lung cancer malignancy, disruption of the normal cell cycle regulation is amongst the critical altered pathways due to TS exposure. TS components induce nicotine acetyl-choline receptor (nAChR) signalling, activating cell surface receptors such as β-adrenergic and EGF receptors stimulating tumour promoting cascade. This interaction facilitates tobacco-induced cancer progression, an important genomic alteration in lung cancer. It has also been evident that TS represses negative regulators of cell cycle progression including cyclin-dependent kinase (CDK) inhibitors. Two classes of CDK inhibitors are capable of arresting cells in G1 phase halting cell cycle progression by preventing their tumour suppressor retinoblastoma (RB) phosphorylation. Functional inactivation of RB protein induced by nAChR leads to deregulated E2F activity, corresponding with aberrant cell proliferation driving cell cycle progression and inhibiting components that arrest cell cycle (Schaal and Chellappan 2014). Inactivation of RB1 is involved in the development of NSCLC (Imai et al. 2004). Genome replication and progression through each cell cycle division is dictated by the cyclin-dependent kinase (CDK)-RB-E2F axis, the cell cycle progression transcriptional machinery. Disruptions in any of (CDK)-RB-E2F axis components lead to a heightened mitogenic E2F activity and uncontrolled proliferation (Kent and Leone 2019).

Our GSEA analysis showed significantly elevated expression of E2F genes family and its targets in LUAD tumour samples associated with TS exposure. Thus, we analysed CDK-RB-E2F genes expression, including but not limited to AURKB, BUB1, CDCA2 and DLGAP5. Our findings confirmed previous reports showing that AURKB gene is frequently overexpressed in tumour samples of lung cancer patients, a key regulator of mitosis associated with poor prognosis (Yu et al. 2018), possibly reflecting the requirement for increased mitotic spindle genes expression for increased replication maintaining a tumorigenic phenotype (Al-Khafaji et al. 2017). Our analysis showed that AURKB expression is three times higher in tumour samples from smokers than never smokers. Interestingly, AURKB expression in newborns was found to be related to the mother’s smoking status, with observed decreased expression in offsprings of mothers who quit the use of tobacco during pregnancy (Nguyen et al. 2019). We also found the expression of BUB1 to be 2.5 times higher in CS tumour samples. BUB1 codes the mitotic checkpoint protein serine/threonine kinase which plays an important role in chromosome segregation (Han et al. 2015). Bidkhori et al. established that the overexpression of BUB1—involved in cell division—amongst other genes leads to tumour progression in LUAD despite of patients’ smoking status (Bidkhori et al. 2013). Both proteins AURKB and BUB1 are linked in literature with the poor prognosis of different cancers (Davidson et al. 2014; Ricke and van Deursen 2011). Another cell cycle associated gene which we found to be 2.5 times overexpressed in CS tumour samples is CDCA2. Shi et al. observed upregulated levels of CDCA2 gene in vivo and in vitro in lung carcinoma cells with adjacent normal tissue. Furthermore, they noticed that knockdown of CDCA2 in LUAD tissue inhibits tumour cells by G1 phase arrest, therefore CDCA2 seems to play a significant role in adenocarcinoma progression (Shi et al. 2017, 2). All four (AURKB, BUB1, DLGAP5 and CDCA2) described genes emerging from our analysis are regulated by the p53-DREAM pathway. DREAM is a transcriptional repressor that binds to E2F and participates in the control of all checkpoints from DNA synthesis to cytokinesis including G1/S, G2/M and spindle assembly checkpoints. According to numerous studies, downregulation of DREAM target genes promotes general loss of checkpoint control, chromosomal instability and aneuploidy of cancer cells (Engeland 2018).

Landi et al. showed that mitotic genes involved in cancer development are deregulated by TS; this applies specifically to genes that regulate the mitotic spindle formation (Landi et al. 2008). Our results run in parallel with theirs regarding PRC1, MAD2L1, ASPM, RACGAP1, CCNA2, MKI67, KIF2C, TPX2, BIRC5, TTK, NEK2 and CENPF, where they were all found to be overexpressed in CS-related tumours (Supplementary Table 6). These data shed a light on characterising the molecular profile differentiating tumours in the context of TS in LUAD. It has already been shown that TS, nicotine in particular, upregulates BIRC5 expression in NSCLC possibly inducing an early developmental stage in adenocarcinoma CS compared to NS (Hirano et al. 2015). We identified higher expression in CS of 1.5 log fold change for BIRC5. TTK, a key component of the spindle assembly checkpoint, is linked to mitosis through EGFR gene which is frequently altered in lung cancer (Landi et al. 2008). ESPL1 was previously found to be overexpressed in lung adenocarcinoma smoking patients (He et al. 2018), and Zhang et al. suggested that it may be combined with other genes including BUB1 and E2F1 to form a regulatory network causing abnormalities of cell proliferation, transport and metabolism hence facilitating tumour progression (Zhang et al. 2016). The loss of mitotic checkpoint, chromosomal instability and aneuploidy are not only mechanisms supporting cancer development but, additionally, proliferation and metastasis. All these processes undergo the control of resistin coding by RETN gene. Our results show RETN to be upregulated in NS. Its expression also seemed to increase as the size of the tumour and clinical stage progressed and correlated with poor clinicopathological status. Zhao et al. (2018) demonstrated no correlation of resistin expression with sex, point of diagnosis, smoking or blood type in LUAD unlike in other forms of cancer (Zhao et al. 2018).

Another gene of interest to us is CDCA5 which has also been linked to cell cycle progression and its overexpression in lung adenocarcinoma leading to tumour progression. The CDCA5 gene encoding the sororin protein joins the cohesin complex regulating sister chromatid segregation. In mitosis, sororin undergoes phosphorylation, and protein kinases such as Cdk1/cyclinB and ERK2 regulate its dynamic localisation and function (Bidkhori et al. 2013). Thus, it interacts with the key regulatory factors ERK and cyclin E1 of G1/S mitotic checkpoint and in the S and G2/M phase ensuring accurate separation of sister chromatids by interacting with coherents and CDK1 (Wu et al. 2019). We found the expression of this gene to be 1.5 times higher in CS tumour samples compared to NS tumour samples. The E2 ubiquitin-conjugating enzyme family member UBE2C is overexpressed in 27 human cancers, possibly acting as a proto-oncogene. UBE2C is involved in mitotic cyclin B degradation, promoting cell cycle transition from M to G1 phase. Its overexpression leads to changes in ubiquitination; thus, it could possibly be involved in uncontrolled cell proliferation; additionally it has been identified to be associated with smoking in LUAD (Dastsooz et al. 2019), confirming our results, and drinking habits in PAAD. Oliviero et al. also identified that genes with strong and very strong correlations with UBE2C are involved in the cell cycle process. Of the genes they identified, we here also show that over expression of some of them are associated with smoking status in our study including KIF18B, KIFC1, KIF4A, AURKB, TPX2, PLK1, CDK1, CENPA, CDC20, MYBL2, BUB1B, CCNB1, NCAPG, SKA3, E2F1 and RAD51. Of the related genes include several different roles involved in mitosis (KIF18B, AURKB, TPX2, NCAPG); regulation of cell division and DNA damage response (PLK1); cell division and driving cells into S phase (CDK1); cell cycle progression, survival and differentiation (MYBL2); mitotic checkpoint complex (BUB1B); and controlling the correct exit from meiosis, migration of meiotic spindle (SKA3). They also showed co-expression of transcription factor RAD51 involved in HR and DNA repair, and E2F1 activating genes involved in the G1/S phase, which was already stated, were identified to be differentially expressed in our LUAD cohort and overexpressed in our CS group. When expressed and activated in late G1 and S phase, MYBL2 binds directly to the promoters and transactivates genes expressed in G2/M phase including CCNB1, CDK1 and CCNA2 (Musa et al. 2017), all genes identified in our study (Supplementary Table 6).

SNHG6 is a novel oncogene promoting LUAD cell proliferation, migration, invasion, EMT and cell cycle progression in vitro. Findings by Liang et al. suggest that SNHG6-miR-26a-5p-E2F7 axis is critical for LUAD cell metastasis and EMT and that E2F7 could rescue the inhibitory effects of sh-SNHG6 or miR-26a-5p (Liang et al. 2018). Our results for E2F7 show a 1.5 log fold change, with a higher expression in CS. E2F7 has been shown to repress the expression of genes involved in maintaining genomic stability throughout the cell cycle as well as upon induction of DNA lesions interfering with replication fork progression. It has also been shown to restrict HR through transcriptional repression of RAD51, also a gene identified to be overexpressed in our CS cohort. Briefly, E2F7 DNA damage response function is through transcriptional-dependent regulation of DNA repair genes (Mitxelena et al. 2018). SCGB1A1 seems to be an important defence molecule of the lung, with anti-inflammatory function protecting the lungs from excessive inflammation. It is hypothesised that chronic cigarette smoke exposure shuts down SCGB1A1 through an epigenetic mechanism (Zhu et al. 2015). Our results with a lower expression of 2.7-folds in CS compared to NS, run in parallel with it being one of the top hypermethylated genes in CS compared to NS in their epithelial cells (Zhu et al. 2015). We found PRG4 to be overexpressed in NS almost 6 folds with its role widely studied in modulating inflammatory responses of OA; in cancer, however, its anti-invasive effect has only been investigated in breast. In breast cancer, PRG4 inhibitor demonstrated a higher fold change difference between the early stage of breast cancer compared to controls, suppressing cell migration and invasiveness (Lee et al. 2016).

Multidrug resistance (MDR) in cancer cells could notably attenuate chemotherapy response and increase the likelihood of mortality. The overexpression of ATP-binding cassette (ABC) transporters is a major mechanism presenting MDR, resulting in increased efflux of drugs from cancer cells, thereby decreasing intracellular drug concentration (Sun et al. 2012). Additionally, it is considered that ABC transporters might also be involved in early steps of carcinogenesis (Andersen et al. 2015). Interestingly, adenoma cells present a higher level of ABCC2 gene expression than carcinoma tissues. Furthermore, upregulation of ATP-binding cassette (ABC) transporter genes is overexpressed in bronchial tissue of chronic smokers in comparison to patients following smoking cessation (van der Deen et al. 2005). The present study identified ABCC2 gene expression to be significantly higher, four times, in CS tumour samples than NS. It is worthy of mentioning that the expression of all genes included in ABC transporters cellular pathway were enriched in our study amongst CS tumour samples (Supplementary Table 5). Cancer treatment success is not only associated with MDR mechanisms, but specific genetic mutations may also result in increased drug sensitivity or therapeutic resistance. For instance, EGFR tyrosine kinase inhibitors (TKIs) being the first line treatment of choice for adenocarcinoma patients, T790M mutation is the major mechanism of acquired resistance (Hochmair et al. 2019). Therefore, identifying new potential cancer mutations could contribute to the development of new therapeutic strategies. Here we analysed the top mutated genes in adenocarcinoma tissue and found higher mutation count in CS population overall, with six being statistically significant (Supplementary Table 10). TP53 and EGFR mutations are common and have been identified to alter clinical outcomes, substantially if they were concurrent, activating alternative proliferative pathways that bypass EGFR as a target (Vyse and Huang 2019). The mutation of some of these genes according to literature, EGFR, TP53, ZFHX4 and MUC16, is characterised and correlated with potential cancer therapeutics (Qiu et al. 2019; Aithal et al. 2018, 16; Goodwin 2018), whereas others, RYR2, LRP1B and NAV3, are yet to be investigated (Wang et al. 2019; Carlsson et al. 2012). We identified a key component of DNA homologous repair-RAD51 to be overexpressed in CS. Radiation can induce RAD51 to express and translocate between cytoplasm and nucleus (Zhong et al. 2016). A recent study in NSCLC cell lines established evidence that EGFR status demonstrates distinct radiosensitivity and DSB repair function associated with the expression and subcellular distribution of RAD51. Consequently, defective RAD51-mediated DNA repair could be a consequence of aberrant signal transduction from hypersensitive NSCLCs with EGFR mutation (Zhong et al. 2016), hence there, shedding light on possible effective role as a therapeutic in radiation resistance EGFR cells.

EMT plays an important role in tumour progression and metastatic invasion. Additionally, it has been reported to cause acquired resistance to EGFR-tyrosine kinase inhibitors, and several signalling pathways including Wnt, Notch and integrin are known to activate EMT through transcriptional repression of E-cadherins (Loh et al. 2019). A recent study by Yu et al. (2019) investigated in vitro the role of asbestos-related microenvironment on lung cancer progression, revealing an increase in proliferation and migration of lung cancer cells exposed to media from asbestos-exposed lung fibroblasts, thus highlighting the importance of the interaction of lung fibroblasts with asbestos and lung cancer metastasis (Yu et al. 2019), hence there supporting our hypothesis that the local environment is of importance for the development of specific cancer natures and imprints on the tumour’s local EMT-related features, prompting aggressiveness. Additionally Zhou et al. demonstrated that during endochondral ossification, oestrogen blocked the EMT process from the resting zone (RZ) to the proliferative zone (PZ), whereas androgen promoted MET from PZ to hypertrophic (HZ) (Zhou et al. 2015). Our results regarding EMT analysis indicate the need for further investigation of the role of androgen and oestrogen on the process. Cigarette smoking plays a vital role in promoting EMT and is associated with poor survival, cell migration and invasion in NSCLC through the deregulation of E-cadherin. Our results also show VEGF and TFG-β signalling pathway to be significantly enriched. TGF-β signalling pathway plays a quite important role in EMT (Xu et al. 2009). Fantozzi et al. found that VEGF-A is required for the increased tumorigenicity of cells undergoing EMT, yet not sufficient on its own. An additional angiogenic factor regulated in their expression during EMT is critical for an effective tumour initiation. Their results seem to give perception as to the mechanisms underlying increased tumourigenicity of cancer stem cells (CSC) and cells undergoing EMT in line with similar results in skin and brain tumours (Fantozzi et al. 2014). We identified collagen type X alpha 1 (COL10A1), a member of the collagen family, to be enriched in skeletal development. In gastric cancer (GC), COL10A1 was found to possibly act as a potent co-stimulator of TGF-β1-induced EMT, where its expression was vastly elevated during GC development and progression. It also promoted cell migration and invasion via upregulation of the TGF- β1-SOX9 axis, while its knockdown slowed TGFβ1-induced EMT (Li et al. 2018). Thus similar to GC, it is worthy of elucidating the role of COL10A1 in LUAD as it could possibly serve as a biomarker and therapeutic target.

TS is an important risk for lung and head and neck squamous carcinoma (HNSCC), where benzo[a]pyrene (B[a]P) in cigarette smoke shown evident to induce EMT-related gene expression, impacting multiple transcriptional changes in lung cancer cells. Of our statistically significant EMT genes, TCF3 was significant for both smoking subgroups. TCF3 is involved in the regulation of Wnt signalling pathway in embryogenesis and development (Li et al. 2017). In females, it was indeed identified to show response to smoking stimulus along with ICOSLG, the vascular adhesion molecule 1 VCAM1 in B cells, emphasising the significance of B cells in the aetiology of smoking-induced diseases involving interactions between immune cells and vascular endothelium (Pan et al. 2010). However, in their study, they found its expression to be downregulated in female smokers contrary to our results where we did not separated both sexes. This is very much likely attributed to sex as well as the effect of smoking, accentuating the influence of sex on future exploratory analysis.

Yoshino et al. showed that TS promoted EMT by upregulating TWIST (Yoshino et al. 2007). Further studies by Zhu et al. investigating TWIST interactions with gender and smoking in patient survival revealed that both smoking and gender could modify the effects on the risk of death in HNSCC patients. Their analysis revealed that TWIST1 links to poor survival obvious in either males or smokers than in either female or non-smokers (Zhu et al. 2017). Additionally, Stoyianni et al. used cancers of unknown primary (CUP) as a model for metastatic dissemination to study the prognostic significance of EMT. The presence of EMT phenotype significantly correlated with male gender, anaplasia and visceral metastases with strong adverse prognostic significance on patient survival. They also showed as others already did a possible role for NOTCH2 and 3 activation in regard to the induction of EMT phenotype (Stoyianni et al. 2012). Since there is almost no information regarding smoking and sex on the expression of EMT-related genes (except for TWIST1/2), we wanted to observe gender and smoking-induced differences correlating to the expression of EMT-related genes that could possibly allow unravelling of potential biomarkers for personalised therapy. Joint analysis of 144 EMT marker genes accounting for sex and smoking status allowed us to identify 4 statistically significant genes in all four groups (FN, MN, FC, MC) of which one was related to the NOTCH pathway (Fig. 3). Statistically significant differences between FN and FC groups included HEY2 (p = 0.048), OLFM1 (p < 0.05) and STRAP (p = 0.0031); for FN vs MC included OLFM1 (p < 0.05) and STRAP (p = 0.0001); and, finally, for MN and MC groups was SFRP1 (p = 0.0391). Overexpression of OLFM1 was found to result in smaller metastatic and pulmonary nodules and reported as a potential target for the diagnosis and treatment of LUAD (W. Shi et al. 2016). SFRP1 which functions as modulators of Wnt signalling is a favourable predictive and prognostic biomarker for prostate cancer (Zheng et al. 2015). While both STRAP and HEY2 are involved with NOTCH signalling pathway, STRAP promoted the stemness of HCC through epigenetic regulation (Jin et al. 2017). In NSCLC, lncRNA PRNCR1 upregulates HEY2 promoting tumour progression by competitively binding miR-448 (Cheng et al. 2018).

The National Comprehensive Cancer Network (NCCN) Clinical Practice Guidelines in Oncology proposed that a patient’s sex to be a prognostic factor in LUSC, LUAD, HNSC, KIRC and KIRP. This allows to provide an overview of molecular differences between cancer patients of both sexes with distinct patterns of the effect of sex on the patient group in each cancer type. Sex-biased genes include EGFR—the key therapeutic target in LUAD—with female-biased mRNA expression possibly contributing to a higher response rate in female patients to their counterparts (Yuan et al. 2016). Our results for GSEA analysis (Supplementary file 3) for female and male LUAD patients regardless of smoking status seem to contrast those of Yang et al. in two glioblastoma (GBM) studies, where they found cell cycle signalling pathways to be critical determinants of survival in males and integrin signalling for females (Yang et al. 2019, 2017) in both molecular function and biological processes gene sets. It is interesting to find out whether these differences seem to be attributed to the effect of tobacco smoking or due to an altered mechanism in lung adenocarcinoma metabolism.

Also in relation to sex differences, we found a few of our DEGs to be overexpressed in female NS including PLA2G2A and PRG4 while REEP1 and PLA2G2A in male CS (Supplementary Table 8). Underexpressed genes also include GINS4 in female NS and WIF1 and PRG4 in female CS. PLA2G2A a member of the subfamily of group II sPLA2 was overexpressed in our NS samples as part of the GnRH signalling pathway members, in both CS and NS seems to have a higher log fold change for females to males. Proteoglycan 4 (PRG4) appears to have an inversely proportional relationship between females and males in our CS and NS cohort. PRG4 also seems to have a unique ability to block NFkB activation downstream from two different receptor families (CD44 and TLRs), where it might possibly be an anti-inflammatory agent (Alquraini et al. 2015). This could possibly explain its higher expression in male CS tumour samples as according to their habits they seem to use all tobacco products more often compared to females (Higgins et al. 2015). Also of interest to us were GINS4 and WIF1 being overexpressed in males CS compared to females. GINS4 has an essential role in DNA replication initiation and progression of DNA replication forks and the DSB repair via break induced replication (Kamada et al. 2007). While WIF1 is an inhibitor of the Wnt/β-catenin pathways, its overexpression in male CS could possibly play a role in the regulation of cell proliferation, metastasis and differentiation. That is because in osteosarcoma, WIF1 knockdown rescued proliferation, migration and invasion of osteosarcoma cells (Cai et al. 2019). REEP1/2 has been shown to be selectively upregulated in response to DNA damage, being downstream targets of p53. Zheng el al. showed that p53 indirectly regulates endoplasmic (ER)-mitochondria contacts and Ca2+ via upregulating REEP1/2, EI24 and VDAC2. REEP1/2 are accessory elements helping ER to expand and facilitate EI24-mediated apoptosis and overexpression of REEP1 but not REEP2, promoting DNA-damage induced apoptosis (Zheng et al. 2018). This possibly explains the increased expression levels of REEP1 in our female cohort corresponding with the hypothesised better overall survival compared to males.

Differences in the immune system and genetic architecture, whether being sex-specific or due to smoking status, will allow unravelling of epigenetic marks and shedding light on the mechanisms potentially providing scientific evidence for interventions, cancer surveillance and treatment. Pharmacokinetics and pharmacodynamics differ in individuals in terms of toxicity and efficacy; therefore, sex influences pathophysiology, clinical signs and treatment and response in cancer, making it vital to further investigate outcome implications based on potential sex differences in cancer, enhancing and personalising therapeutics. In the future, pharmacogenomic differences between sexes could have a noteworthy role in chemotherapy. An understanding of the significance of sex-specific genetics and epigenetic architecture of human disease could help reveal the existence of sex-specific protective mechanisms that could be utilised in novel treatments (Gabory et al. 2013).

Nevertheless, it should be mentioned that one of the caveats of our study was the lack of control and normal samples from LUAD patients, to help advance our understanding in determining biomarkers that are differentially expressed based on the aspects of tumour and sex of the patient due to their unavailability. Another limitation includes the inability to divide the patients into subgroups according to their age/stage due to the small number of cases meaning that prospective study would have had an insufficient number for the analysis to allow a reliable investigation. We believe that it is unfortunate that patients of certain subgroups including the elderly and those with poor performance are not considered in advanced personalised therapy as they do require specific treatment considerations in terms of targeted therapies.

Conclusion

Here we constructed a gene co-expression network and identified 9 modules with distinct expression profiles of LUAD in CS and NS patients. One thousand one hundred fifty genes showed a visible differentiation between the subgroups. These genes were primarily involved in cell cycle, replication and proliferation. Twenty-five genes were also identified with a twofold or higher between the smoking subgroups proving some clues on the effect of smoking prevalence on LUAD. We found a contradictory pattern for EMT members between NS and CS that could be a result from direct nicotine exposure or possibly other gene interactions giving rise to a different outcome due to mutation, treatment or gender differences. Our study highlights significant differences in male and female patients, identifying new candidate genes such as PLA2G2A and PRG4 that could enable a better understanding of sex-specific differences in the course of lung carcinogenesis. A thorough analysis of the biochemical processes associated with pathogenesis at the molecular level of LUAD assessing the impact of tobacco smoke or the lack of will enable a better search for “driver” genes contributing to its initiation or progression. An early on identification of prognostic molecular markers relating to tumour aggressiveness is useful since following resection, tumour metastasis is a fundamental barrier to long-term survival. Prospective studies need to include larger numbers of patients to accrue sufficient numbers of cases reliably investigating the risks associated with other exposures.