Background

Lung cancer is a kind of malignant tumor with high morbidity and mortality [1]. In China, this malignant tumor has the highest mortality rate, accounting for about 25% of cancer-related deaths in the world [2]. At present, many risk factors are found to increase the risk of lung cancer. Among them, smoking seems to be strongly associated with lung cancer risk [3]. However, a new research has shown that worldwide, 15–20% of men with lung cancer are non-smokers while over 50% of women with lung cancer are non-smokers [4], indicating the importance of other risk factors such as exogenous air pollution, environmental, and genetic factors. According to the latest statistics, genetic factors have been identified to be robustly associated with lung cancer [5]. If the family history of lung cancer is from a first-degree relative, the risk increases by 2–4 times even after controlling for smoking history [6].

Lung cancer is the leading cause of cancer mortality worldwide, in which women are less than half as likely to die of lung cancer as men [1]. Lung cancer in non-smokers tends to be more common in females [4]. These findings have drawn attention to investigate the effects of estrogen on lung cancer risk. It has been reported that both estrogen receptor and aromatase are present in human lung tumors [7, 8]. These results suggest that estrogen may play a role in the biological behavior of human lung cancer.

Cytochrome p450 (CYP450) enzymes are pivotal for biological homeostasis. CYP450 enzymes also play a key role in the metabolism of many endogenous substrates and exogenous carcinogens as well as aromatic and heterocyclic amines. They then covalently combine with DNA to form DNA adducts, which in turn cause cancer [9, 10]. The CYP450 family 19, subfamily A, and polypeptide 1 (CYP19A1) gene encodes aromatase, which is a member of the CYP450 superfamily and a key enzyme in oestradiol biosynthesis. Mutations in the CYP19A1 gene can result in either increased or decreased aromatase activity [11], and aromatase plays an important role in lung cancer [12]. This suggests that CYP19A1 genetic variations may indirectly affect the occurrence of lung cancer, but the exact mechanism is unclear. At the same time, many works of literature have reported an inseparable relationship between the genetic variant of CYP19A1 and lung-related diseases, including lung cancer [13]. Previously, CYP19A1 rs3764221 has been studied to be significantly associated with the multicentric development of lung adenocarcinomas [13]. Moreover, CYP19A1 rs727479 is also significantly associated with the incidence of lung cancer [14]. However, there are still a large number of single-nucleotide polymorphisms (SNPs) in CYP19A1 whose association with lung cancer risk has not been reported.

Based on Han Chinese in Beijing (CHB) population in 1000 genome database (http://www.internationalgenome.org/) and the dbSNP database (http://www.bioinfo.org.cn), four SNPs (rs28757157 (NG_007982.1:g.90395G > C), rs3751592 (NG_007982.1:g.29218A > G), rs3751591 (NG_007982.1:g.29086 T > C), and rs59429575 (NG_007982.1:g.28719G > A)) in CYP19A1 with the minor allele frequency more than 5% were randomly selected. These SNPs in this study have been reported in the genome-wide association studies (GWAS) chips of published GWAS studies about testicular germ cell tumor and breast cancer [15, 16], but not lung cancer. Here, this study aimed to investigate the association between these four SNPs in the CYP19A1 gene and lung cancer susceptibility through a case–control study.

Methods

Participants

In order to ensure the accuracy and credibility of the research results, we used G * Power 3.1.9.7 software (https://stats.idre.ucla.edu/other/gpower/) to estimate the sample size before we planned to conduct this study. The specific parameters we set were as follows: effect size d = 0.2; α error probability = 0.05; and power (1-β error probability) = 80%. This calculation produced a sample of at least 412 cases and 412 controls. Here, we recruited 489 cases and 467 controls in this study, larger than the total sample size recommended by G * Power. In the study, we recruited 489 pathologically confirmed lung cancer patients from Xuanwei City, Yunnan. All cases were diagnosed as lung cancer by histological examination according to the World Health Organization tumor classification system and confirmed by two independent pathologists. The exclusion criteria for patients were as follows: (1) history of other tumors; (2) family history of lung cancer; (3) chemotherapy or radiotherapy treatment; (4) hypertension, diabetes mellitus, or any endocrine metabolic diseases; and (5) other lung diseases. The control group was composed of 467 healthy subjects who were volunteer blood donors from the same city as the cases. Controls with a history of any cancers, other endocrine metabolic diseases, or other lung diseases should be excluded. Eligible study participants were screened by completing a specialized questionnaire, which included demographic characteristics, disease history, lung status, and family history of other types of tumors. All participants were of Chinese Han ancestry from northwest China. The research protocol according to the Helsinki Declaration was conducted with the approval of the First People’s Hospital of Yunnan Province Ethics Committee, and written informed consent from all subjects was attained.

SNP selection

Four SNPs (rs28757157 (NG_007982.1:g.90395G > C), rs3751592 (NG_007982.1:g.29218A > G), rs3751591 (NG_007982.1:g.29086 T > C), and rs59429575 (NG_007982.1:g.28719G > A)) in CYP19A1 were randomly selected based on the following: (1) the variations of CYP19A1 through the e!GRCh37 (http://asia.ensembl.org/Homo_sapiens/Info/Index) database in the CHB and CHS population; (2) Hardy–Weinberg Equilibrium (HWE) > 0.01, minor allele frequency (MAF) > 0.05, and min genotype > 75% using Haploview software; (3) combined MassARRAY primer design software, HWE > 0.05, MAF > 0.05, and the call rate > 95% in our study population; and (4) a MAF > 0.05 based on the database of 1000 genome (http://www.internationalgenome.org/) and dbSNP (http://www.bioinfo.org.cn) databases.

SNP genotyping

Genomic DNA was extracted from collected peripheral blood samples using a DNA purification extraction kit (GoldMag Xi’an, China). The concentration and purity of DNA were determined quantitatively by an ultraviolet spectrophotometer (Nanodrop 2000, Thermo, USA). Multiplexed SNP MassEXTEND assay was designed with the Agena Bioscience Assay Design Suite software, version 3.0 (Agena Bioscience, USA). SNP genotyping was conducted utilizing the MassARRAY platform (Agena Bioscience, USA). The principle of MassARRAY is matrix-assisted laser desorption/ionization (MALDI) time-of-flight (TOF) mass spectrometry (MS). First, a locus-specific PCR reaction was performed, followed by a locus-specific primer extension reaction (iPLEX assay), in which oligonucleotide primers were annealed directly upstream of the polymorphism of genotyping. In the iPLEX assay, primers and amplified target DNA were incubated with a large number of modified dideoxynucleotide terminators. The primer extension is made according to the sequence of mutation sites and is a single complementary mass-modifying base. The quality of the extended primers was determined by MALDI-TOF mass spectrometry. The quality of the primers indicates the sequence, therefore, the allele present at the polymorphic locus of interest. Using MALDI-TOF mass spectrometry, SNP alleles could be identified with different qualities of extended primers [17, 18]. Data processing was carried out with Agena Bioscience TYPER software, version 4.0 (Agena Bioscience, San Diego, CA, USA) [19]. A 10% randomly selected samples were re-analyzed with 100% consistency for quality control.

Statistical analysis and bioinformatics analysis

SPSS software (SPSS 22.0, USA) and Microsoft Excel were used for statistical analysis. Continuous variables were evaluated for normality using the Kolmogorov–Smirnov test. Continuous variables (age and body mass index (BMI)) with non-normal distribution as median with interquartile range (IQR) were compared using the Mann–Whitney U test. The differences in gender, smoking, and drinking distribution between the case and control groups were determined by the χ2 test. The χ2 test was used to determine whether individual polymorphisms were in HWE. In addition, χ2 test was used to detect the difference in allele and genotype frequencies between cases and controls. The SNPStats software (https://www.snpstats.net/start.htm?q=snpstats/start.htm) was adopted to define the relationship between polymorphisms and the risk of lung cancer in the Chinese Han population in different genetic model analyses (genotype, dominant, recessive, and additive models). Logistic regression analysis was used to calculate odds ratios (ORs) and 95% confidence intervals (CIs) to evaluate the relationship of four selected SNPs with lung cancer risk [20,21,22]. Binary logistic regression was used for the two SNP interactions associated with lung cancer susceptibility. The p < 0.05 was considered statistically significant in all tests. The functionality of candidate SNPs was annotated using the HaploReg v4.1 (https://pubs.broadinstitute.org/mammals/haploreg/haploreg.php), RegulomeDB (https://regulome.stanford.edu/regulome-search/), and QTLbase (http://www.mulinlab.org/qtlbase/index.html) databases.

In multifactor dimensionality reduction (MDR) analysis, multilocus genotypes were classified into high- and low-risk groups. With this method, multidimensional genotype variables were transformed into single-dimensional ones [23]. In order to explore the association of high-order SNP-SNP interactions with the susceptibility to lung cancer, we used the MDR method including cross-validation and permutation-test procedures. Cross-validation could minimize the possibility of false-positive results by dividing the data into a training set and a testing set and repeating each part of the data. Balanced accuracy was used to assess model quality. The overall best model with the greatest accuracy in the testing data was selected. The cross-validation consistency (CVC) provided a list of the number of cross-validation intervals in which a particular model was found. The permutation testing indicated the cross-validation consistency and the prediction error are statistically significant at the 0.001 level. This indicates that among 1000 permuted datasets, no best models had a cross-validation consistency or a prediction error of the same magnitude as was observed for the original dataset. Higher numbers indicated more robust results. A permutation test was used to assess the significance of the best model [24]. The optimal CYP19A1 SNP-SNP interaction model for lung cancer susceptibility was performed through MDR 3.0.2 software.

Results

Study population

In this study, 489 lung cancer patients (337 males and 152 females) was involved as well as 467 healthy controls (326 males and 141 females). The median (IQR) ages of cases and controls were 61.00 (56.00–65.00) years old and 61.00 (55.00–65.00) years old, respectively (Table 1). In addition, the characteristics of the study population were collected for subsequent studies, including BMI, smoking, and drinking history, pathological type, pathological stage, and lymph node metastasis (LNM). There was no significant difference in age, gender, BMI, smoking, and drinking between the case group and the control group (p > 0.05).

Table 1 Characteristics of the study population

Genetic analyses of the selected SNPs with the risk of lung cancer

Four SNPs in CYP19A1 were genotyped among subjects. The representative spectrum of each SNP is displayed in Supplemental Fig. 1. The basic information about all candidate SNPs is listed in Table 2. All SNPs are located on chromosome 15 and in the different positions of the CYP19A1 gene. The deviation of Hardy–Weinberg equilibrium in the control group was evaluated, and the results showed that the candidate SNPs all met the expected p value (p > 0.05), and satisfied further study. In addition, under the allele model, there was a significant difference in the allele distribution of rs28757157 between the lung cancer cases (0.215) and healthy controls (0.174), and rs28757157 T allele might contribute to an increased risk of lung cancer (p = 0.025, OR = 1.30, 95% CI 1.03–1.64). Functional prediction of SNPs was conducted in HaploReg v4.1 and RegulomeDB databases to explore their regulatory effect. The results showed that four SNPs exhibited potential biological functions in gene regulation. Based on QTLbase database, the genotypes of CYP19A1 rs28757157 (p = 6.610e − 5) were related to the mRNA expression of CYP19A1 in the lungs (Fig. 1).

Fig. 1
figure 1

Overview of eQTL for rs28757157 (a) and trait-wise plot of eQTL for rs28757157 in the lung (b)

Table 2 Basic information of candidate SNPs CYP19A1

Under four genetic models, the relationship between CYP19A1 polymorphisms and the risk of lung cancer is listed in Table 3. Our results revealed an association between rs28757157 and increased risk of lung cancer in the genotype (p = 0.034, OR = 1.43, 95% CI 1.09–1.88), dominant (p = 0.011, OR = 1.41, 95% CI 1.08–1.85), and additive (p = 0.021, OR = 1.34, 95% CI 1.04–1.71) models.

Table 3 Analysis of the association between CYP19A1 polymorphisms and risk of lung cancer

Stratification analyses by demographic characteristics

In addition, we conducted a stratified analysis by demographic characteristics (age, gender, BMI, smoking, and drinking) to explore the risk effects of these SNPs in specific groups, as shown in Table 4. The results of age stratification indicated that rs28757157 (genotype: p = 0.018, OR = 1.83; dominant: p = 0.005, OR = 1.82; and additive: p = 0.006, OR = 1.78), rs3751592 (genotype: p = 0.032, OR = 1.87; dominant: p = 0.010, OR = 1.93; and additive: p = 0.009, OR = 1.81), and rs59429575 (genotype: p = 0.047, OR = 1.71; dominant: p = 0.014, OR = 1.75; and additive: p = 0.016, OR = 1.57) were associated with an increased susceptibility to lung cancer in people aged under 60 years. Moreover, rs28757157 exerted a risk role in the development of lung cancer among females in the dominant (p = 0.033, OR = 1.76), and additive (p = 0.036, OR = 1.70) models. In smokers, rs28757157 (dominant: p = 0.031, OR = 1.55; and additive: p = 0.042, OR = 1.46) might confer to a higher risk for the occurrence of lung cancer. In addition, rs28757157 (genotype: p = 0.033, OR = 2.03; dominant: p = 0.009, OR = 2.04; and additive: p = 0.010, OR = 1.99) and rs59429575 (dominant: p = 0.044, OR = 1.75; and additive: p = 0.044, OR = 1.63) were related to an increased risk of lung cancer in drinkers, whereas rs3751592 (p = 0.023, OR = 3.31) was identified as a genetic risk factor for lung cancer susceptibility in non-drinkers. However, no significant correlation between CYP19A1 polymorphisms and lung cancer risk after stratification by BMI was found.

Table 4 Stratification analyses by demographic characteristics for the association between CYP19A1 polymorphisms and the risk of lung cancer

Stratification analyses by clinical characteristics

As listed in Table 5, the correlation between CYP19A1 polymorphisms and lung cancer risk in the different groups (tumor type, LNM, and stage) was assessed. The stratified analysis by tumor type demonstrated a relationship between enhanced risk of squamous cell carcinoma and rs28757157 (dominant: p = 0.032, OR = 1.59; and additive: p = 0.042, OR = 1.48), while rs3751592 CC genotype was identified as a risk factor for lung adenocarcinoma development (genotype: p = 0.011, OR = 3.57; and recessive: p = 0.013, OR = 3.84). Regrettably, no significant association between CYP19A1 polymorphisms and lung cancer risk in the stratification analyses by LNM and tumor stage was observed.

Table 5 Stratification analyses by clinical characteristics for the association between CYP19A1 polymorphisms and the risk of lung cancer

The two SNP interactions associated with lung cancer susceptibility.

AS displayed in Table 6, rs28757157-rs3751592 (p < 0.001, OR = 2.03), rs28757157-rs3751591 (p < 0.001, OR = 1.75), rs28757157- rs59429575 (p < 0.001, OR = 1.55), and (p = 0.011, OR = 1.31) were associated with the higher lung cancer susceptibility.

Table 6 The two SNP interactions associated with lung cancer susceptibility

MDR analysis

The association between higher-order SNP–SNP interactions and the predisposition to lung cancer was examined by MDR, as summarized in Fig. 2 and Table 7. Figure 1 presented that these four polymorphisms exhibited strong redundancy effects on the risk of lung cancer, and rs28757157 had the information gain (2.22%) of individual attributes regarding the occurrence of lung cancer. Table 6 summarized that the most influential single-locus attributor for lung cancer risk was rs28757157 (testing balanced accuracy of 0.5503 and cross-validation consistency of 10/10).

Fig. 2
figure 2

The dendogram (a) and Fruchterman Rheingold (b) of CYP19A1 SNP-SNP interaction for the risk of lung cancer. Green and blue color indicated stronger redundant interactions. Values in nodes and between nodes represent the information gains of an individual attribute (main effects) and each pair of attributes (interaction effects), respectively

Table 7 SNP–SNP interaction models of CYP19A1 polymorphisms in lung cancer susceptibility

MDR analysis of gene-environment interaction also suggested that rs28757157 was the most influential single-factor attributor for lung cancer risk. Gender and smoking were found to be the most important environmental factor affecting lung cancer susceptibility. In addition, the gene-environment interaction model, composed of rs28757157, rs3751591, gender, BMI, and smoke showed higher testing-balanced accuracy (0.601) and cross-validation consistency (9/10), indicating that this interaction model was a candidate gene-environment model in our population. Figure 3 exhibited a strong synergy effect of gene-environment interaction on lung cancer risk.

Fig. 3
figure 3

The dendogram (a) and fruchterman Rheingold (b) of CYP19A1 gene environment. SNP-SNP interaction for the risk of lung cancer. Green and blue color indicated stronger redundant interactions. Values in nodes and between nodes represent the information gains of individual attributes (main effects) and each pair of attributes (interaction effects), respectively

Discussion

In this study, the association of four SNPs in the CYP19A1 gene with the susceptibility to lung cancer in the Chinese Han cohort was assessed. Statistical and bioinformatics results highlighted the important roles of rs28757157, rs3751592, and rs59429575 in the outset of lung cancer in the total or stratified population, which helped improve our understanding of CYP19A1 in this disease.

CYP19A1 gene, encoding aromatase and responsible for the final step in the biosynthesis of estrogens, estradiol (E2) and estrone (E1), has been intensively investigated [25, 26]. It has been identified that SNPs in the intron region of CYP19A1 play an important role in the transcriptional regulation and splicing of CYP19A1 and could produce some different enzymes with diverse enzyme activity compared with normal gene products [27]. The allele frequency of several CYP19A1 SNPs have been documented in different populations and ethnic groups around the world. SNPs in CYP19A1 were found to be associated with cancer risk [28]. In particular, CYP19A1 SNPs have been shown to be significantly associated with lung-related diseases.

A previous study has shown that SNP rs3764221 is significantly correlated with CYP19A1 expression in non-cancerous lung tissues and affects the susceptibility to lung adenocarcinoma. The authors suggested that CYP19A1 polymorphisms may lead to elevated levels of local estrogen surrounding the lungs, and this excess local estrogen production may be one of the factors associated with the polycentric development of adenocarcinoma [13]. The recent result has suggested that CYP19A1 polymorphism is involved in lung bronchioloalveolar carcinoma and atypical adenomatous hyperplasia by causing differences in estrogen levels [29]. It is clear that CYP19A1 polymorphism may cause changes in estrogen levels around the lungs, which in turn can affect the susceptibility of lung cancer. Our results firstly revealed an association between rs28757157 and increased risk of lung cancer in the genotype, dominant, and additive models. In bioinformatic analysis, results from HaploReg v4.1 database displayed that rs28757157 may be associated with enhancer histone marks, motifs changed, and selected eQTL hits [30]. Based on the QTLbase database, the genotypes of CYP19A1 rs28757157 (p = 6.610e − 5) were related to the mRNA expression of CYP19A1 in the lungs [31]. These results suggested that CYP19A1 rs28757157 may be involved in the carcinogenicity of lung cancer by affecting the expression or function of CYP19A1, which requires further experimental confirmation.

Notably, the demographic characteristics (age, gender, BMI, smoking, and drinking) might influence the genetic association on the occurrence of lung cancer [32]. Our research showed that CYP19A1-rs28757157 was associated with increased cancer risk in the population aged under 60 years, females, smokers, and drinkers. Besides, rs3751592 and rs59429575 were also identified as risk biomarkers in the population aged under 60 years and drinkers. These results indicated that the risk association of these polymorphisms might be age-, sex-, smoking-, and drinking-dependent, and gene-behavioral habit interactions might operate in the pathogenesis of lung cancer.

These SNPs are located in the intron region of the CYP19A1 gene. Combined with previous studies and database predictions, we speculated that CYP19A1 intron SNPs may alter mRNA splicing, thereby leading to changes in the activity of CYP19A1 and related estrogens, and may affect disease susceptibility. Since the statistical significance of the correlation between CYP19A1 gene polymorphisms and the risk of lung cancer is slightly weak, further experimental studies are needed to verify the results of this study.

Furthermore, the correlation between CYP19A1 polymorphisms and lung cancer risk in different groups (tumor type, LNM, and stage) was further assessed. Stratified analysis by tumor type demonstrated a relationship between enhanced risk of squamous cell carcinoma and rs28757157, while rs3751592 CC genotype was identified as a risk factor for lung adenocarcinoma development. These findings suggested that lung adenocarcinoma and squamous cell carcinoma may have different genetic pathological mechanisms, which need to be further confirmed.

Our study has several limitations. All subjects were enrolled from the same hospital and the limitations of sample selection may affect the accuracy of this experiment. Subsequently, due to the lack of adequate information on factors such as dietary habits, occupational exposure, and air pollution, this study failed to assess the impact of these factors on the association between CYP19A1 variants and lung cancer susceptibility. Additional studies that encompass more geographical regions, additional ethnic groups, and larger sample sizes with complete risk factor information should be performed. In order to verify the results of this study, it is necessary to clarify the relationship between the CYP19A1 gene and lung cancer through subsequent functional studies.

Conclusions

In summary, our study defined SNPs of CYP19A1 (rs28757157, rs3751592, and rs59429575), which were significantly associated with lung cancer susceptibility. These variants may be considered as markers for lung cancer risk assessment in the Chinese Han population.