Introduction

Welding processes join rigid material pieces (usually metal) at their contact interface by using high temperatures to cause fusion. This process can be hazardous because it exposes the operator to extremely toxic fumes and to radiant energy1. The International Agency for Research on Cancer (IARC) has recognized WFs and UV radiation from welding as Group 1 carcinogens2. WFs are mainly composed of metallic oxides, silicates and fluorides, including those of magnesium, manganese, zinc, aluminum, beryllium, copper, chromium, cadmium, lead, iron, nickel and vanadium3.

Welders inhaling WFs in large quantities over a long period run a significantly elevated risk of developing certain types of cancer1,2. These metastatic diseases involve uncontrolled or neoplastic growth of cancer cells that arise after the accumulation of genomic mutations, but other factors with powerful effects on cancer behaviour and growth include genetic factors and environmental factors the suffer is exposed to4. Environmental factors include inhaled toxic fumes that affect the lungs and enter the circulation to reach many tissues, and which can affect cellular gene expression of cancer cells and thereby their behaviour, survival, growth and invasiveness. Thus, influences such as WF inhalation affects the progression of many types of cancers, including those focused on in this study, specifically CC, PC, LC and GC, which are among the cancers most commonly linked with WF exposure5,6,7. The aim of this study is therefore to identify mechanisms through which WFs may increase cancer incidence.

LC is one of the most lethal types of cancer and globally is a leading cause of death1,2,8. WFs contain toxic metallic oxides and silicates that directly affect the sensitive tissues of the lung when inhaled, the manner of exposure (by inhalation) makes this the cancer with the highest risk for welders9. CC arises in the colon and the rectum and has a typical 5-year survival rate of about 60%. It damages colon or rectum by uncontrollable and invasive cell growth10. Iron, aluminum and magnesium oxide of the welding fumes are known to affect the incidence of CC9, although this is not well understood. PC affects prostate, the gland which produces seminal fluid and controls the transportation of sperm11. Nitrogen oxides, carbon dioxide and phosgene are risk factors for prostate neoplasms that are found in WFs9. GC (gastric or stomach cancer)12 is linked to exposures to nickel, beryllium and cobalt oxides which are all present in WFs9.

In this study, we developed a systematic and quantitative network-based approach to investigate the effects of WFs on gene expression and how these effects may give a clue as to how they encourage the incidence and progression of cancers through affecting pathways and pathway genes that are also altered in these cancers. Thus, we compared gene expression effects of WF exposure with the altered pattern of gene expression seen in CC, PC, LC and GC. This involved, firstly, analyzing differentially expressed gene profiles, then filtering these genes through gene-disease association networks, signaling and ontological pathways, and protein-protein interaction networks. We also investigated the importance of genes and pathways thus identified by using the gold benchmark databases dbGaP and OMIM to identify evidence to support the involvement of these genes in pathological processes such as cancer development. Moreover, we analysed patient survival and its association with the genes that are dysregulated in both the WF-exposed tissue and the four types of cancers. The influence on cancer patient survival of these identified genes provides evidence for their involvement in WF effect on cancer progression.

Methods and Materials

Overview of the analytical approach

We applied an analytical approach to identify links between WF exposure and the incidence of the cancers by employing selected microarray datasets shown in the block diagram of the applied analytical approach shown in Fig. 1. This quantitative approach used genes differentially expressed in WF exposure, and identifies those that are also common to the differentially expressed genes observed in each cancer study. Further, these shared or common differentially expressed genes were used to construct gene-disease (diseasome) association network, identify signaling and ontological pathways, protein-protein interaction (PPI) network and survival function analysis. This approach also used gold benchmark databases OMIM and dbGaP validate genes and pathways identified in our study as showing possible disease associations.

Figure 1
figure 1

Flow-diagram of the analytical approach used in this study.

Datasets employed in this study

To identify the gene expression dysregulation that is common to WFs and the four types of cancers under investigation, we analyzed gene expression microarray datasets from the National Center for Biotechnology Information (NCBI). We examined five different microarray datasets with accession numbers GSE62384, GSE25071, GSE55945, GSE10072 and GSE268513,14,15,16,17. Dataset GSE62384 was produced using human upper airway epithelial cells (RPMI 2650) exposed to spark generated WFs. These data were generated from cells exposed to WFs for 6 hours continuously at low (85 μg/m3) and high (760 μg/m3) concentrations. The CC dataset (GSE25071) consists of microarray data taken from 17 colorectal cancer sufferers who had late-onset CC (mean age 79 years) and 24 patients with early-onset CC (mean age 43 years). The PC dataset (GSE55945) is a microarray data on RNA taken from radical prostatectomy tissue from prostate cancer patients at the Beth Israel Deaconess Medical Center which compared tissue from PC sufferers (Gleason score 6 or 7) with normal prostate tissue. The LC dataset (GSE10072) contained microarray data comparing normal lung tissue and lung adenocarcinoma tissue collected from 26 former smokers, 20 non-smokers (who never smoked) and 28 current smokers; gene expression data are reported by comparing 49 non-tumor and 58 tumor lung tissues. The GC dataset (GSE2685) contains microarray data from 22 gastric cancer and 8 non-cancerous gastric tissues.

To analyze the patient survival association of the altered genes that are common to WFs and the four types of cancers under investigation, we retrieved clinical and RNAseq data for CC, PC, LC and GC from the cBioPortal 18,19. In the clinical dataset of CC (Colorectal Adenocarcinoma, TCGA, Nature 2012) there are 585 samples with 24 features. The samples of CC have RNAseq gene expression data included 224 cases with 224 mutated genes20. The clinical dataset of PC (Prostate Adenocarcinoma, TCGA, Cell 2015) includes 333 samples with 86 features. The RNAseq gene expression data of PC has 333 cases with 333 mutated genes21. The LC clinical dataset (Lung Adenocarcinoma, TCGA, PanCancer Atlas) consists of 566 samples with 81 features. The samples of LC have RNAseq gene expression data included 510 cases with 566 genes22. The clinical dataset of GC (Stomach Adenocarcinoma, TCGA, Nature 2014) contains 295 samples with 52 features. The samples of GC have RNAseq gene expression data included 265 cases with 295 mutated genes23. We employed six clinical factors (ethnicity, anatomical site of cancer, histological grade of cancer, primary tumour site, and neoplasm status with tumour) to analyze the survival of the altered genes that are common to WFs and the four types of cancers under investigation. The summarized description of the datasets is shown in Tables 1 and 2.

Table 1 Summarized description of the datasets used for gene expression and enrichment analysis.
Table 2 Summarized description of the datasets used for survival prediction.

Analysis methods

Microarray-based gene expression analysis is a global and sensitive method to identify and quantify possible molecular mechanisms that underlie human disorders24. We used these approaches to analyze the gene expression profiles of CC, PC, LC and GC to find the genetic effects of WFs that may influence the development of these cancers. To allow comparisons of the mRNA expression data generated using different platforms and to avoid complications arising from the different experimental systems employed in the original studies, we normalized the gene expression data by means of Z-score transformation (Zij) for each type of cancer tissue gene expression profile using \({Z}_{ij}=\frac{{g}_{ij}-mean({g}_{i})}{SD({g}_{i})}\), where SD denotes the standard deviation, gij denotes the value of the gene expression i in sample j. After this transformation gene expression values of different diseases at different platforms can be directly compared. We applied unpaired t-tests to find differentially expressed genes of each disease over control data and selected significantly dysregulated genes. We have chosen a threshold of at least 1 log2 fold change and a p-value for the t-tests of \( < =1\times {10}^{-2}\). We employed the neighborhood-based benchmark and the multilayer topological methods to find gene-disease associations. We constructed a gene-disease network (GDN) using the gene-disease associations, where the nods in the network represent either gene or disease. This network can also be recognized as a bipartite graph. The primary condition for a disease to be connected with other diseases in GDN is they should share at least one or more significant dysregulated genes. Let \(D\) is a specific set of diseases and G is a set of dysregulated genes, gene-disease associations attempt to find whether gene \(g\in G\) is associated with disease \(d\in D\). If \({G}_{i}\) and \({G}_{j}\), the sets of significantly dysregulated genes associated with diseases Di and Dj respectively, then the number of shared dysregulated genes \(({n}_{ij}^{g})\) associated with both disorders Di and Dj is as follows25:

$${n}_{ij}^{g}=N({G}_{i}\cap {G}_{j})$$
(1)

The common neighbours are the based on the Jaccard Coefficient method, where the edge prediction score for the node pair is as26:

$$E(i,j)=\frac{N({G}_{i}\cap {G}_{j})}{N({G}_{i}\cup {G}_{j})}$$
(2)

where G is the set of nodes and E is the set of all edges. We used R software packages “comoR”27 and “POGO”28 to cross check their genes-disease associations.

To investigate how molecular determinants from the WF exposed tissues relate gene expression alterations in the cancers, we analyzed pathway and gene ontology using Enrichr 29,30. We used KEGG, WikiPathways, Reactome and BioCarta databases for analyzing signaling pathway31,32,33,34. We used GO Biological Process and Human Phenotype Ontology databases for ontological analysis35,36. We also constructed a protein-protein interaction sub-network for each CD, using the STRING database, a biological database and web resource of known and predicted protein-protein interactions37. Furthermore, we examined the validity of our study by employing two gold benchmark databases OMIM and dbGaP.

To determine the patient survival association of the altered genes that are common to WFs and the four types of cancers under investigation, we employed Cox PH model for univariate and multivariate analysis38,39. The Cox PH model can be written as follows:

$$h(t|{X}_{i})={h}_{0}(t)exp({\beta }^{T}{X}_{i})$$
(3)

Here \(h(t|{X}_{i})\) is the hazard function conditioned on a subject \(i\) with covariate information given as the vector \({X}_{i}\), \({h}_{0}(t)\) is the baseline hazard function which is independent of covariate information, and β represents a vector of regression coefficients to the covariates correspondingly. We have calculated the hazard ratio (HR) based on the estimated regression coefficients from the fitted Cox PH model to determine whether a specific covariate affects patient survival. The HR for a covariate \({x}_{r}\) can be expressed by the following simple formula exp \(({\beta }_{r})\). Thus, the HR for any covariate can be calculated by applying an exponential function to the corresponding \(({\beta }_{r})\) coefficient.

The survival status of a patient can be estimated by calculating PL estimator40 of the survival function can be defined as follows:

$$\hat{S}({t}_{j})=\mathop{\prod }\limits_{i=1}^{j}\,(1-\frac{{d}_{j}}{{n}_{j}})$$
(4)

Here \(\hat{S}({t}_{j})\) is estimated survival function at time tj, dj is the number of events occurred at tj, and nj is the number of subjects available at tj. After estimating survival function, two or more groups can be compared using a log-rank test. We used Log-rank tests to detect the most significant genes in the case of patient’s survival time in altered versus normal (non-altered) groups in context of gene expression. The null hypothesis for this test can be symbolically explained as follows:

$${H}_{0}:{S}_{altered}(t)={S}_{normal}(t)$$
(5)
$${H}_{A}:{S}_{altered}(t)\ne {S}_{normal}(t)$$
(6)

Here H0 is survival functions that are the same for altered and normal gene and HA is survival functions that are not the same for these two groups.

If the survival function of a specific gene is different among altered and normal groups then we include it to the combined Cox PH model. This approach is efficient for learning the effect of a specific gene of interest on patient survival in the presence of the clinical factors.

Results

Gene expression analysis

To identify and investigate the gene expression effects of WFs that may influence the behaviour of various types of cancer, we analyzed the gene expression microarray data collected from the National Center for Biotechnology Information (NCBI). We observed that WFs have 903 differentially expressed genes obtained by adjusted \(p < =0.01\) and \(|logFC| > =1\). The differentially expressed genes of WFs contain 392 up and 511 down-regulated genes relative to controls. Similarly, the statistical analysis identified the most significant genes with altered expression in each cancer type. The number of differentially expressed genes we identified was 939 (503 up and 436 down) in CC, 553 (323 up and 230 down) in PC, 890 (673 up and 217 down) in LC and 691 (463 up and 228 down) in GC. We also employed a cross-comparative analysis to find the common genes with altered expression between WFs and each CD. We found that WF treated cells share a number of differentially expressed genes with for CC (36 dysregulated genes), PC (13 genes), LC (25 genes) and GC (17 genes). To identify the significant associations among these cancer types with the effects of WF exposure, we constructed two separate gene-disease association-ship networks for up and down-regulated genes using Cytoscape plugins41, centered on the WF data as shown in Fig. 2(a,b). The necessary condition for two diseases to be associated is they must have at least one or more common differentially expressed genes in between them. Notably, two particular significant genes, C2orf88 and IGFBP5 were differentially expressed among WF exposure, CC and PC; and three significant genes, FCGBP, IQGAP2 and HPGD are common among WF exposure, CC and GC. One gene, FGFR3, is commonly dysregulated among WF exposure, CC and LC.

Figure 2
figure 2

(a) Up-regulated gene-disease association network of welding fumes (WFs) exposure with colorectal cancer (CC), prostate cancer (PC), lung cancer (LC) and gastric cancer (GC). Octagon-shaped red-colored nodes represent different cancer types and sky-blue colored round-shaped nodes represent commonly up-regulated genes for WFs with the cancers examined. (b) Down-regulated gene-disease association network of welding fumes (WFs) exposure with colorectal cancer (CC), prostate cancer (PC), lung cancer (LC) and gastric cancer (GC). Octagon-shaped red colored nodes represent different cancer types and dark-cyan colored round-shaped nodes represent commonly down-regulated genes for WFs exposure with the different types of cancer examined. (c) Diseasome network showing validation of our study. Red colored octagon-shaped nodes represent different cancer types, pink-colored octagon-shaped nodes represent our selected four CDs and round-shaped sky-blue colored nodes represent differentially expressed genes for WFs exposure. A link is placed between a disease and a gene if mutations in that gene lead to the specific disease.

Pathway and functional association analysis

Pathways are constituted by a series of interactions at the molecular level in a cell, and are a vital key to understand the internal changes of an organism. Pathway-based analysis can be used to identify molecular or biological mechanisms that underlie the development of complex diseases42,43. We analyzed pathways of the commonly altered expression genes seen in WF exposure and in the cancers using Enrichr, a comprehensive web-based gene set enrichment analyzing tool29,44. Signaling pathways of the commonly altered expression genes of WF exposure and each type of cancer examined were analyzed using four globally recognized databases includes KEGG, WikiPathways, Reactome and BioCarta. We considered signaling pathways from the selected four databases and identified the most significant signaling pathways of each CD after applying several statistical analysis. Notably, we found 6, 7, 5 and 7 signaling pathways are associated with CC, PC, LC and GC, respectively, as shown in Fig. 3.

Figure 3
figure 3

Pathway analysis for identifying the most significant signaling pathways common to the WF exposed cells and the cancer types revealed by the common differentially expressed genes. These include significant signaling pathways common to WFs exposed cells and (a) CC (b) PC (c) LC and (d) GC.

Gene ontological analysis

The Gene Ontology (GO) refers to a universal conceptual model for representing gene functions and their relationship in the domain of gene regulation. It is constantly expanded by accumulating the biological knowledge to cover the regulation of gene functions and the relationship of these functions in terms of ontology classes and semantic relations between classes45. We analyzed ontological pathways of the commonly altered expression genes seen in WFs exposed cells and each cancer type using two recognized databases including GO Biological Process and Human Phenotype Ontology. We considered ontological pathways from selected two databases and identified the most significant ontological pathways for each cancer type after applying several statistical analysis. We found 10, 11, 14 and 14 ontological pathways are associated with the CC, PC, LC and GC, respectively, as shown in Tables 36.

Table 3 The most significant ontological pathways common to the WFs exposed cells and CC.
Table 4 The most significant ontological pathways common to the WFs exposed cells and PC.
Table 5 The most significant ontological pathways common to the WFs exposed cells and LC.
Table 6 The most significant ontological pathways common to the WFs exposed cells and GC.

Protein-protein interaction analysis

A protein-protein interaction network refers to the binding of proteins in the cell formed by biochemical or complex biological functions. Protein-protein interactions are essential to understand the cell physiology in health and disease states. We constructed and analyzed protein-protein interaction networks of the significantly altered expression genes of each CD using the STRING database. We clustered protein-protein interactions of cancer types into four different groups as shown in Fig. 4.

Figure 4
figure 4

Protein-protein interaction network of the four types of cancer using STRING.

Survival analysis

Patient survival analysis using both gene expression and clinical data is a popularly used feature in research to predict and characterize gene signatures in cancer46. In this study, we estimated survival function for altered and normal groups of the significant genes that are common to WFs and the four types of cancers under investigation by employing Cox PH model and PL estimator analysis. We fitted both univariate and multivariate analysis of the Cox PH regression model. The significant genes of the four selected cancers with estimated coefficients (β), hazard ratios (HR) and p-values from those analyses are shown in Tables 710. After these analyses we selected the most significant genes for the four types of cancers by choosing a threshold (\(p < =0.05\)) of the p-value. The survival curves of the most significant genes, comparing altered and normal groups had been obtained by using the PL estimator as shown in Fig. 5. Note that, from Fig. 5, we can see that those with altered expression of genes show lower survival compared to the normal group.

Table 7 β coefficient, hazard ratio and p-values in univariate, multivariate and combined models of the identified genes that are common between WFs and CC.
Table 8 β coefficient, hazard ratio and p-values in univariate, multivariate and combined models of the identified genes that are common between WFs and PC.
Table 9 β coefficient, hazard ratio and p-values in univariate, multivariate and combined models of the identified genes that are common between WFs and LC.
Table 10 β coefficient, hazard ratio and p-values in univariate, multivariate and combined models of the identified genes that are common between WFs and GC.
Figure 5
figure 5

Survival function for an altered and normal group of the most significant genes that are common to WFs and the four types of cancers under investigation. These include significant genes common to WFs exposed cells and CC (a–e), PC (f,g), LC (h,i) and GC (j–l). Here, the cyan colored line in the survival graphs indicates the altered and the red indicates the normal gene expression group.

Discussion

In this study we investigated how WF exposure may influence a number of types of cancer whose development and growth is greater with exposure to WFs or the components of WFs. We compared the gene expression alterations that result from WF exposure in cells with that of the genes that have dysregulated expression in several cancer types. The idea behind this is similar to studies of comorbidities, where dysregulated genes (or more usually gene pathways) that are common to two diseases give clues to how those diseases interact when co-occurring in the same individual, even if we are unclear as to the reason for the altered expression of individual genes or pathways is unclear. Thus, genes or gene pathways altered in response to WF exposure and the cancers of interest can be means by which WF exposure encourages those cancers to develop. Note that WFs included components such as metal fumes that are absorbed by the lungs into the bloodstream, to expose many tissues around the body. Many of these fumes are carcinogenic, but cancer initiation is only one of a number of stages of cancer development and progression, and welders commonly have regular exposure to fumes over long periods. Unlike in other morbidities, some altered gene expression may arise in individual cancer cells due to mutations which will affect survival of those cells; if such altered expression the is detected in whole cancer tissue across many individuals (as in our studies) then the alteration may be affecting pathways that encourage survival and growth. Thus we have applied a systematic approach to identify pathways that WFs may affect the cancer behaviours.

For our analysis we employed gene regulation analysis, gene-disease association networks, signaling and ontological pathways, and protein-protein interaction networks. To identify pathways and genes that are important in WF interactions in the cellular processes that influence cancer progression, we examined gene expression microarray data from WF exposed cells, CC, PC, LC and GC, each with control datasets. This identified a large number of significant genes that were commonly dysregulated between WF-exposure and cancer profiles, and evident by simple gene expression comparisons. There were a number of dysregulated genes that were common between WF exposure responses and cancer types, which suggests that WF exposure may cause gene expression changes that could affect the behaviour of cancers. It should be noted that the cancer transcriptome datasets, such as those employed here, contain transcripts from both cancer cells and the supporting stromal cells found in the tumors themselves. Thus, it should be noted that WFs may exert their effects on cancers either indirectly (through tumor stroma) or on the cancer cells themselves.

We constructed two separate gene-disease association networks for up- and down-regulated genes showed strong evidence that WFs may indeed influence these cancers as indicated in Fig. 2(a,b). The pathway-based analysis is a technique to better understand the molecular or biological mechanisms underlying different complex diseases by determining common pathways that a stimulus (such as WFs) may influence cells of interest. We identified significant signaling and ontological pathways of the commonly dysregulated genes of each cancer. These identified pathways indicated how WFs may affect these cancer types. Similarly, protein-protein interaction sub-networks of the commonly altered genes suggest that WFs affect several types of cancers. Note that if a pathway is a conduit for the effects of an important risk factor for a disease, this points to that pathway being particularly important to the pathogenesis of the disease and that reducing that pathways effects could be a way to attack the disease progression itself. It should be noted that these findings only point to possible ways that WF exposure may affect the cancers and cannot prove causation. However, when we investigated whether the gene expression patterns that we have observed could be associated with reduced survival of the patients (pointing to the importance of those gene expression levels either directly or indirectly) that is what we observed for several of significant genes that are common WF the cancer profiles under investigation as shown in Fig. 5.

It should be noted that the datasets employ a number of different cell types, which is commonly the case in this type of study. While gene expression patterns are, by definition, different in different cell types, here we were only concerned with expression alterations; certain responses to WFs may not occur in all cells so, while our approach cannot identify all pathways affected by WFs in nascent tumour cells, it will find some. Indeed, our data provides evidence to suggest the involvement of a number of genes in cancer behaviours that are linked to the noxious effect of WFs on cancer.

We used the gold benchmark databases OMIM and dbGaP for cross checking the validity of our outcome and found that there were some shared genes in between the WF exposure and cancer types as shown in Fig. 2(c). For validation purposes, we collected disease with associated genes from the dbGaP, OMIM Disease and OMIM Expanded databases using differentially expressed genes of WFs. After several steps of statistical analysis we selected only cancer related diseases. Interestingly, we found our selected four cancers among the list of cancers collected from the mentioned databases as shown in Fig. 2(c).

Moreover, we found our identified genes in Fig. 2(c) had been shown in other studies to be associated with disease progression in cancers. Specifically, Vázquez-ArreguÃn K. et al., Cybulski C. et al. and Wang L. et al. shown RAB4B, CHEK2 and FOS to be associated with CC incidence47,48,49,50; Biswas S. et al. found a link between TGFBR2 and CC51. Lijovic M. et al. showed CD82 to be linked to PC incidence52; Wang Y. et al. shown the association between CHEK2 and PC progression53; Ouyang X. et al. identified a link between FOS and PC54; Gruosso T. et al. showed MAP3K8 to be associated with LC55; Vallejo A. et al. found a link between FOS and LC incidence56; Yuan S. et al. showed an association between GPC5 and LC progression57. Kim CJ. et al. found MUTYH to be associated to GC incidence58; Myllykangas S. found an association between FOS and GC59; Teodorczyk U. et al. found CHEK2 to be linked to GC progression60. Therefore, it suggested that WFs may have a strong interaction with CC, PC, LC and GC.

Conclusions

In this study, we considered gene expression microarray data from WFs exposure, CC, PC, LC, GC and control datasets to analyze and investigate the genetic links between WF exposure and the effects that they have on cancers. We analyzed gene expression, constructed gene-disease association networks, identified signaling and ontological pathways, analyzed protein-protein interaction networks and survival function of WFs exposed cells and cancers. The outcome of our study indicated that WFs can exert a strong influence on cancers. This kind of study will be useful for making more accurate disease prediction, and identifyi potentially better therapeutic approaches. This study will also be useful for assessing the dangerous effects of welding on the human body.