Introduction

The concept of the human exposome was first proposed almost two decades ago as a framework to guide research that explores the etiological complexities of health and disease (Wild, 2005). Because the relationship between multifactorial exposure patterns that influence health outcomes is complex, there is a need for studies that incorporate information from multiple exposures. Approaches that include a single environmental exposure may not fully or accurately describe the risk of disease because mixing factors may alter the effects of a single exposure (Wild, 2005; Zhang et al., 2021). Alpha radiation, consisting of two protons and two neutrons, can be easily stopped by skin or paper, yet is harmful if ingested. Beta radiation, comprising electrons or positrons, can also be readily halted but poses risks to the human body when ingested. Gamma radiation, a high-energy electromagnetic wave, is produced by nuclear reactions and has strong penetrative capabilities; therefore, external exposure can cause significant harm. For example, recent studies have shown that known associations between fine particulate matter (PM2.5) and health are modified by gross \(\beta\) activity (a measure of the counts of \(\beta\) ray per unit time) (Blomberg et al., 2019; Dong et al., 2022). In the present study, we leverage statistical and ML methods to simultaneously consider the effects of background radiation levels (gamma radiation emitted from airborne radioactive particles and indoor radon gas), PM2.5 exposure, and other social and behavioral factors on county-level lung cancer rates.

The effects of radiation on health have long been investigated. Since the discovery of radiation and its subsequent applications in the nuclear fuel cycle, industry, security and consequence management, and nuclear medicine, the study of radiation-induced health effects on humans has been an important field of research. The evaluation of the risk from radiation exposure at high doses/high dose rates has been relatively well established through the Life Span Study of atomic bomb survivors (Seong et al., 2016; Ozasa et al., 2019; United Nations Scientific Committee on the Effects of Atomic Radiation [UNSCEAR], 2008), the Chernobyl workers study (Morton et al., 2021), and nuclear weapon fallout studies (Lyon et al., 2006; Takahashi et al., 2003). However, in determining radiation-induced outcomes, less is known about low-level or background exposures and their interactions with other environmental toxins.

Radon Exposure and Lung Cancer

Radon is produced from the decay of uranium and is emitted from terrestrial sources and building materials. It is known that exposure to radon is affected by various factors such as housing characteristics, surficial uranium concentration, soil permeability, and even groundwater (Mose & Mushrush, 1999; Ponciano-Rodríguez et al., 2021; Przylibski et al., 2022; Smith & Field, 2007). This alpha emitter and its progenies are absorbed into the lungs, which in the process exposes the airways to radiation. Radionuclides inhaled into the lungs and bronchi cause the ionization of biological molecules, which in turn causes DNA damage and potentially cancer (Abergel et al., 2022; National Research Council [NRC], 1999; McDonald et al., 1995). In this process, most of the inhaled radon is exhaled from the lung because radon is an inert gas. However, its progenies have short half-lives that usually decay before they are removed from the lung through exhalation (Tirmarche et al., 2010). Additional complexities arise when considering the other particles and gases that the radon progeny bind to, such as PM2.5. The pollution particles may serve as a vector for the deposition of radon progeny into the lungs.

Epidemiological studies that link radon exposure to lung cancer risk show conflicting results (Cohen & Colditz, 1994; Kreuzer et al., 2010, 2015; Mifune et al., 1992; Yoon et al., 2016). Although there is strong evidence from studies of uranium miners (Richardson et al., 2021), studies of the broader population are less conclusive. A recent ecological study conducted in Mexico examined the relationship between indoor radon exposure and lung cancer mortality. The findings of the study suggested that higher levels of radon concentration may be linked to an increased risk of lung cancer (Ponciano-Rodríguez et al., 2021). However, this study has a limitation in that it did not control variables such as lifestyle or socioeconomic status in the model. In another study based in the USA, Cohen et al. (1994, 1995) used county-level data to investigate the association of radon exposure and lung cancer. This study showed a negative association between lung cancer risk and radon concentration, even though Cohen controlled for several confounding factors, such as smoking, socioeconomic factors, and geography (Cohen, 1995; Cohen & Colditz, 1994). There are several limiting factors within the aforementioned studies, for example, small sample sizes and challenges associated with decoupling the risk associated with radon exposure and other confounding factors (e.g., lifestyle, socioeconomic factors).

New computational methods for population-level exposomic research

The emerging fields of data science and ML provide new opportunities for characterizing the relationship between the exposome and lung cancer by offering alternative analytical methods for modeling complex relationships between social and environmental determinants of health. Using ML methods for modeling complex relationships in epidemiology research has become increasingly prevalent (Wiemken & Kelley, 2020). This study tested the effects of low-dose radiation in single- and multi-exposure models and compared results from traditional methods and ML methods for a comprehensive look at the relationship between low-dose radiation and lung cancer rates in the USA.

Methodology

This ecological analysis utilized county-level data to describe the relationship between two measures of low-dose radiation exposure and lung cancer rates. The following county-level factors were assembled and are summarized in Table 1: (1) environmental radiation exposures (gross gamma activity and indoor radon), (2) non-radiation environmental variables (air quality), (3) lifestyle (smoking), and (4) sociological data (demographic/socioeconomic). These variables were used to predict county-level lung/bronchus cancer risk and incidence. We tested for multicollinearity and all of the variables showed variance inflation factor (VIF) less than 5. Poisson regression and Poisson random forest (RF) regression were used to model lung cancer incidence rates in 662 counties in the USA. The MAPE (Mean Absolute Percentage Error) and RMSE (Root Mean Square Error ) from a fivefold cross-validation were compared across regression models to analyze model performance. The codes used in this analysis can be found in https://github.com/Heechan-Lee/county_radon_lung.cancer.

Table 1 Radiation, environmental, sociological, and cancer incidence datasets of 662 counties in the USA used in this study

Environmental radiation data

Gamma count rate

The US Environmental Protection Agency’s (EPA’s) RadNet system monitors the gamma counts across the United States (Fraass, 2015). The first monitoring center came online in 2006, and since then the number of monitoring centers has increased to 140. (Fig. 1) Since July 2016, 80 monitoring centers also record the gamma exposure rate; however, this data is not available for the entire timeframe of the cancer incidence data, so the gamma gross count rate is used instead. Gamma gross count rates are measurements of radiation emitted from a particulate collected on an air filter—they are not a direct measure of exposure rate.

Fig. 1
figure 1

Locations of 140 counties (or equivalent) that have RadNet monitoring centers in the USA and Puerto Rico

The three most prominent limitations of this dataset were as follows: (1) a high percentage of the monitoring centers were missing data from one or more months, (2) the data were limited to 140 county data points, and (3) some of the monitoring centers did not have records prior to 2013. To overcome these limitations, data were imputed by using existing alternate datasets. First, the monthly average hourly-reported gamma gross count was calculated to capture the seasonality of the data and to minimize the effect of outliers caused by local volatility. Second, the following two imputations were implemented: (1) imputing data of missing months through linear interpolation and seadec (Seasonally Decomposed Missing Value Imputation) function of R (R Core Team, 2021) from the imputeTS package(Moritz & Bartz-Beielstein, 2017) and (2) 2D linear interpolation by using the ‘griddata’ function of SciPy from Python (Virtanen et al., 2020). Linear interpolation and the seadec function outperformed interpolation with mean value as well as other methods, such as ARIMA (Autoregressive Integrated Moving Average) with Kalman filters and seasplit (Seasonally Splitted Missing Value Imputation) function of imputeTS package (Moritz & Bartz-Beielstein, 2017), for imputing the missing months. For 2D interpolation of counties without nearby or inherent geographical obstacles such as mountains or large forests, the interpolation showed less than 15% percentage error between the averaged real and predicted counts. The map of interpolated gamma count rate data is shown in Fig. 2.

Fig. 2
figure 2

Interpolated gamma count rate (RadNet) data for the USA from data of 140 monitoring centers created with seadec function and linear interpolation

We then created a summary measure of gross gamma activity for each observation year by averaging the gross gamma activity for the two years prior to the year of diagnosis (two-year lag).

Radon

Two sources of radon data were utilized in this study: radon zones from the US Environmental Protection Agency (1993) and the median concentration from indoor radon test kits downloaded from the US Centers for Disease Control and Prevention (CDC) database (Centers for Disease Control and Prevention, n.d.). The radon zone data classification was developed by the EPA in 1993 and classifies counties into three groups based on the potential for exposure to indoor radon: Zone 1, representing the highest radon concentration group of 4 pCi/L or higher; Zone 2, with a radon concentration of 2–4 pCi/L; and Zone 3, with less than 2 pCi/L (US Environmental Protection Agency, 1993). Although the classification system is almost three decades old, the classifications can reasonably be assumed as representative of the composition of soil and bedrock, which do not change significantly over this elapsed time. The CDC radon data are the results of indoor radon tests from kits deployed in residential, industrial, and educational locations across the USA from 2008 to 2017. The locations of 662 counties used in the analysis and the average of yearly median indoor radon concentrations of those counties are shown in Fig. 3.

Fig. 3
figure 3

Average of yearly median indoor radon concentration for US Counties with SEER data downloaded from the US Centers for Disease Control and Prevention in pCi/L

Non-radiation environmental data

Air quality

Air pollution, notably particulate matter, is a known lung cancer-inducing factor (Couraud et al., 2012; Dela Cruz et al., 2011; Raaschou-Nielsen et al., 2013; Turner et al., 2011a, 2011b). Air quality-based measurements were obtained from the National Environmental Public Health Tracking Network (CDC, n.d.). This database includes various features, including toxic chemicals, ozone, and PM2.5. For toxic chemicals, the measurements were of an annual average concentration of 2005 and 2011. Ozone data was based on the days that the daily 8-h average ozone concentration exceeded 0.07 ppm between 2001 and 2016. The PM2.5 data was the average concentration of PM2.5 for each year between 2001 and 2016. For data on chemical concentrations, data from 2011 was extracted to best align with cancer incidence data, and the concentrations were assumed not to have changed by 2017. Among the various chemicals, formaldehyde and benzene were employed. For ozone and PM2.5, the average of the concentrations for the 3 years preceding the cancer incidence record was included in this analysis. The locations of 662 counties and their average PM2.5 concentrations are shown in Fig. 4.

Fig. 4
figure 4

Average of PM2.5 concentration for US Counties with SEER cancer incidence data in µg/m3

Lifestyle data

Tobacco smoking

Smoking has been well established as the leading lung cancer-inducing factor (de Groot et al., 2018; Dela Cruz et al., 2011). The smoking data included in this study was adapted from the Robert Wood Johnson Foundation County Health Rankings (CHR) (University of Wisconsin Population Health Institute [UWPHI], 2022; Remington et al., 2015). This dataset provides the percentage of adults who self-identified as smokers in a 2015 state-based random digit dial telephone survey of the Behavioral Risk Factor Surveillance System. The 2015 smoking rates were chosen because smoking rates are relatively constant during the period of observation, and the midpoint of interest was used as the representative year. The map of the smoking rates in 2015 are depicted in Fig. 5.

Fig. 5
figure 5

The percentage of adults in US Counties with SEER cancer incidence data who self-identified as smokers in a 2015 state-based random digit dial telephone survey of the Behavioral Risk Factor Surveillance System

Demographic and socioeconomic data

Demographic and socioeconomic factors from the Surveillance, Epidemiology, and End Results (SEER) database (National Cancer Institute, DCCPS, Surveillance Research Program [NCI-DCCPS-SRP], 2022) were included in the dataset because cancer incidence is affected by various demographic and socioeconomic factors (Siegel et al., 2019). The SEER data includes education level, poverty rate, unemployment rate, rate of residence in urban areas, and divided into age cohorts separated by 5 years to reflect the age effect. Age groups range from 30–34 years old to 80–84 years old for both sexes. Age groups range from 30–34 years old to 80–84 years old for both sexes. The averages from 2008 to 2012 reported for each county were used for high school education, median family income, and unemployment data and were assumed to remain constant. The urban rate was taken from 2010 data. Total population from the 2010 US Census for the 662 countres used in the analysis are shown in Fig. 6.

Fig. 6
figure 6

Total population in 2010 for counties with SEER cancer incidence data

Health outcomes

Lung and bronchus cancer incidence rate

According to cancer statistics, lung and bronchus cancer cause the most cancer deaths and have the second-largest incidence across cancer types in the USA (Siegel et al., 2019). In this study, cancer incidence data employed age group and sex classifications from SEER (NCI-DCCPS-SRP, 2022). Lung and bronchus cancer incidences between 2013 and 2017 of five-year age groups, spanning from 30 to 84 years old were used in this study. This age range was carefully selected to ensure a comprehensive analysis of lung cancer incidence across adulthood, capturing variations in risk that may emerge as individuals age. Further details on age classification can be found in the SEER*Stat documentation (NCI-DCCPS-SRP, 2022). These age groups were used in this study. Figure 7 shows cancer incidence rate for the 662 counties.

Fig. 7
figure 7

Cancer incidence rate of age-groups of interests (incidence per 100,000)

Regression models

Regression analysis was used to study the impact of various factors on health outcomes. Poisson regression, which is a count-based regression method, was utilized in this study.

The Poisson regression model used for the analysis is represented by the equation

$${\text{log}}\left({\lambda }_{i}\right)=\alpha +{\beta }_{1}{\times e}_{1}+{\beta }_{2}\times {e}_{2}+\cdots +{\beta }_{{\text{ag}}1}\times {e}_{{\text{ag}}1}+{\beta }_{{\text{ag}}2}\times {e}_{{\text{ag}}2}+\cdots +{\text{log}}({{\text{Pop}}}_{i})$$

where λi is the expected count of the outcome variables for the ith observation, α is the intercept term, βn denotes the coefficient for the nth predictor variables en is the values of the nth predictor variables and the log (Popi) is the offset term representing natural logarithm of the population for ith observation. For the age group variable, dummy variables were employed. βagn is the coefficient corresponding to the nth age group, and eagn is its associated variable. This dummy variable takes a value of 1 if the observation belongs to the nth age group, and 0 otherwise.

An RF approach was also employed by using rfPoisson from the R package fpechon/rfCountData (Liaw & Wiener, 2002; Pechon, 2019). The RF algorithm, which synthesizes the results from several simple trees of sequential specified questions or criteria to regress the data, can reduce the risk of overfitting on Poisson data (Pechon, 2019). The ML results were compared to Poisson regression through iterative fivefold cross-validation to evaluate the regression models. Comparisons were made with Mean Absolute Percentage Error (MAPE) (Hamner et al., 2018) and Root Mean Square Error (RMSE). In the first 5 times of fivefold cross-validation iteration, the RMSEs were computed. This was followed by a distinct 5 times of fivefold cross-validation process, in which the MAPEs and RMSEs were determined. Additionally, the variable importance measures (VIM, feature importance), which showed the importance of each factor that contributed to the regression results, was derived by using RF. VIMs were calculated using the ‘%IncLossFunction’ metric from the random forest model, which measures the percentage increase in the model's loss function when the values of that feature are randomly permuted, indicating the significance of that feature in the model's predictive performance. Incidence rate ratios (IRRs) are reported only from the Poisson regression to increase the interpretability of the results.

MAPE

The regression results are evaluated using MAPE. This metric is scale-independent, which makes it possible to compare models across different datasets. A smaller MAPE indicates a better fitting model, where a value closer to 0 is preferred. However, it is sensitive to extreme values and positive errors. Additionally, if the actual value is close to 0, there is also a possibility that the error might be exaggerated even if the absolute error has a small value.

$${\text{MAPE}}={\text{mean}}\left(\left|\frac{g\left({x}_{t}\right)-{y}_{t}}{{y}_{t}}\right|\right)*100$$

where g is the regression model, and yt is the target variable (de Myttenaere et al., 2016).

RMSE

RMSE was also used for evaluating the regression results. This metric is not a scale-independent, but one of the popular statistical metrics to be used to measure the magnitude of error between predicted and observed values.

$${\text{RMSE}} = \sqrt {{\text{mean}}\left( {\left( {g\left( {x_{t} } \right) - y_{t} } \right)^{2} } \right)}$$

where g is the regression model, and yt is the target variable.

MAPE and RMSE were calculated from a different validation process.

Results

By developing a dataset of radiation, environmental, and sociodemographic variables that span the period of 2013–2017 (Table 1), Poisson regression and Poisson RF models were employed to model the relationship between the cancer-related factors and the lung/bronchus cancer incidence.

MAPE showed statistically significant differences when T-test was done between Poisson regression and Poisson RF. As the number of samples for each case is 25, degree of freedom is 48. For both males (t(48) = 12.86, p < 0.01) and females (t(48) = 6.40, p < 0.01). RMSE also showed significant differences for both males (t(48) = 8.85, p < 0.01) and females (t(48) = 6.57, p < 0.01) (Table 2).

Table 2 Mean absolute percentage errors (MAPEs), root mean square errors (RMSEs), and their standard deviation from the test set with Poisson regression and Poisson random forest

Tables 3 and 4 summarize the regression results of various datasets through Poisson RF and Poisson regression. Smoking, radiation exposure, and PM2.5, which are thought to be related to radon exposure (Matthaios et al., 2021; Trassierra et al., 2016), and sociodemographic and behavioral factors were combined in various models. The analysis of the relationship between variables and model accuracy revealed an interesting trend in the error from the Poisson RF, as shown in Table 4. The VIM was acquired by averaging the model weights across folds with the entire dataset by using the default function of the fpechon/rfCountData package (Liaw & Wiener, 2002; Pechon, 2019). Table 5 and 6 show the VIMs of the variables analyzed with full model Poisson random forest regression including all variables, including socioeconomic variables, in the model.

Table 3 Mean absolute percentage errors (MAPEs), root mean square errors (RMSEs), and their standard deviation of each data set with Poisson regression
Table 4 Mean absolute percentage errors (MAPEs), Root mean square errors (RMSEs), and their standard deviation of each data set with Random Forest
Table 5 Variable importance measures (VIM) of variables from male dataset with Random Forest
Table 6 Variable importance measures (VIM) of variables from female dataset with Random Forest

Table 7 summarizes the IRRs analyzed with full model Poisson regression. The increased unit of the IRR is proportional to the range of each variable to make a more intuitive comparison. In both cases, smoking had the greatest effect on lung cancer incidence rates. In the case of indoor radon, the association was negative. [Male: 0.99 (0.98, 0.99), Female: 0.99 (0.98, 0.99)]. Also, Background gamma count (RadNet) [Male: 0.97 (0.97, 0.98), Female: 0.98 (0.98, 0.99)] and three-year average PM2.5 for female [0.99 (0.98, 1.00) P-value: 0.09] showed negative associations at higher concentrations, which somewhat contradicts results from previous studies (Ghazipura et al., 2019; Raaschou-Nielsen et al., 2013; Turner et al., 2011a, 2011b).

Table 7 Incidence rate ratios (IRRs) and 95% confidence intervals of each factor of interest with poisson regression

To understand the differences broken down by EPA Radon Zone, separate regression models were run for each zone using the full model Poisson regression (Table 8). In the case of Radon Zone 1, an area with high radon concentration, the effect of PM2.5 exposure was the greatest. Conversely, in the case of Radon Zone 3, which is an area with a low radon concentration, higher rates of PM2.5 were associated with lower incidence rates. The effect of smoking was consistent across all radon zones.

Table 8 Incidence rate ratios (IRRs) and 95% confidence intervals of PM2.5 and smoking by radon zone

Discussion

The effects of environmental exposure on health outcomes are complex. In this study, the results (Table 8) suggest that the assocation between PM2.5 may vary with levels of indoor radon exposure. Despite potential synergistic effects of exposure, many radiation epidemiological studies include a limited number of environmental exposure measures (Haylock et al., 2018; Richardson et al., 2015; Stanley et al., 2019; Tomasek, 2013). Belloni et al. (2020) have noted that few studies (Klebe et al., 2019; Leuraud et al., 2011) have attempted to address multifactorial exposures from environmental stressors. In the study of radiation-related disease, estimating the risk associated with radiation-related lung cancer has been a focal point in resolving the dose-risk response relationship (United Nations Scientific Committee on the Effects of Atomic Radiation [UNSCEAR], 2018). Furthermore, due to the high baseline cancer risk compared to the risk increased from low-dose radiation exposure, the population size required for detecting low-dose radiation risk with statistical significance exponentially increases as the target dose decreases (Ozasa, 2016; Ozasa et al., 2019; UNSCEAR, 2008; Valentin, 2006). To address some of the challenges, studies that use a wider range of data, such as the Million Person Study (Boice et al., 2022), are being conducted (Calabrese, 2015; Ricci & Tharmalingam, 2019; Tubiana et al., 2009; Valentin, 2008; Weber & Zanzonico, 2017). The utilization of population-level exposure variables and health outcomes data adopted in this study can serve as a valuable resource for future research. Population-level data offers an advantage in the adoption of multiple variables and the analysis of diverse health outcomes. Furthermore, ML techniques are particularly well suited to model the complex relationships that exist between environmental exposure and health outcomes. By leveraging ML, it is possible to capture the complex interplay between environmental exposures and health, thereby offering a promising avenue for future research in this field.

The results suggest that PM2.5 should be included in future analysis of radon-induced lung cancer incidence, as there may be an interaction with radon exposure. The observed patterns, where changes in radon concentration result in significant differences (p < 0.001 for all cases) in the effects of PM2.5, corroborate findings from other research that explores the combined impacts of PM2.5 and radon exposure (Dlugosz-Lisiecka, 2016). PM2.5 or other particulate matter could be one of the possible transport mechanisms that allow radon gas to permeate lung tissue. This is further supported by two experimental studies that assess the speciation of PM2.5 particles in the presence of radon progeny. The first study shows that the alpha activity of PM2.5 tends to increase as the concentration of radon increases (Matthaios et al., 2021). The second study shows that in a radon chamber, the presence of particulate matter will increase the attached fraction of radon progeny, thereby implying that the radiation exposure from particulate matter will increase (Trassierra et al., 2016). PM2.5 and radon seem to have synergistic effects and are thought to affect various health outcomes, including incidences of lung cancer. Given the possible synergistic effect between PM2.5 and radon, future epidemiological studies should investigate this further.

This study harnessed ML to consider the non-linear effects of radon exposure within the context of other environmental factors. The results of decreased errors from ML models show that ML is effective at analyzing complex relationships in environmental exposure studies and should be considered in future studies that investigate the relationship between radon exposure and cancer outcomes. One limitation of current ML is the lack of variety in ML algorithm packages that can be applied to count data. However, it is believed that these problems will naturally be resolved as ML develops and becomes more widely used in regression analysis.

Large-scale data can be challenging when conducting analysis attributable to individual characteristics, for example they are limited in their ability to reflect the interaction of environmental and genomic factors, which is important in the exposome approach (Zhang et al., 2021). Furthermore, individual history of exposure information which is similarly essential to exposome analysis is difficult to reflect in the analysis (Zhang et al., 2021). Thus, population-level studies of incidence rates, such as this one, are susceptible to the ecological fallacy. This limits the ability to establish causal relationships between variables and health outcomes. Despite these limitations, population-level studies can still provide valuable reference points for guiding individual-level studies.

The World Health Organization (2009) reported that radon is the second major contributor to lung cancer incidence. Also, a study by Turner et al. (2011a, 2011b), which analyzed county-level radon concentrations and residents' lung cancer risk similar to this study, showed a positive association between residential radon and lung cancer risk. However, our results showed that there was negative association between radon and lung cancer incidence rates [IRR of male: 0.99 (0.98, 0.99), IRR of female: 0.99 (0.98, 0.99)]. There are several reasons our findings may differ from occupational cohort studies that show there is a strong association in occupational studies where individuals are exposed to high levels radon (Kreuzer et al., 2015; Leuraud et al., 2011; Richardson et al., 2021, 2022). First, as mentioned above, this study may suffer from ecological fallacy. Second, indoor radon exposure risk is measured at the county level and radon exposure varies widely across counties (Li et al., 2021). Third, the effect sizes at low levels of exposure are likely small—making the signal difficult to detect in an ecological analysis. Our results of study which investigated the association between residential radon exposure to lung cancer is difficult distinguished are more aligned with results from recently published residential exposure and lung cancer-based study (Li et al., 2020). The study on residential radon exposure and lung cancer risk in Connecticut and Utah (Sandler et al., 2006) could not provide evidence of an increased risk of lung cancer at the exposure levels observed. Unlike minor studies, the residential radon exposure is so low that statistically significant results are difficult to obtain.

Furthermore, the difference in findings across studies may arise from discrepancies between individual-level and population-level approaches in their methodologies and analysis. Also, regarding the interaction between smoking and radon, the results were different from the previous studies. According to BEIR VI, a comprehensive analysis of the relationship between smoking habits, radon exposure, and lung cancer risk of uranium miners from several studies showed a submultiplicative effect, which means that the risk in the population exposed to both smoking and radon is greater than the sum of the individual risks expected from either smoking or radon exposure and less than the product (NRC, 1999). The results of a case-control study in Spain after BEIR VI indicated that there is a strong synergistic effect between smoking and radon exposure, and the case-control miner study showed evidence of submultiplicative interaction between radon and smoking (Barros-Dios et al., 2012; Leuraud et al., 2011). However, the association between smoking and radon concentration did not appear to be significant in the results presented herein. These inconsistent results again may be attributed to certain limitations in this study, including terse measurement of radon concentrations. Using the median data could prevent the effects of outliers, but it will have errors from the insufficient number of tests. This problem could skew the results toward non-significant associations or even contradict established knowledge.

Possible confounding factors that were not properly reflected are that the level of stress that people experience, and the quality of medical care will vary considerably by county or state despite some socioeconomic factors being included. This may also explain the opposite trend in this analysis vs. the previously known results of PM2.5 and lung cancer incidence. These problems could be mitigated if the research is conducted on specific regions with very high-resolution data, or by improving our measures of radon concentrations. Another limitation of this study is the lack of residential history data, which made it impossible to create a model that adequately considers different exposures across a life span and the associated latency periods. Other lung cancer models have considered the incubation period of 5 years (National Research Council [NRC], 2006; UNSCEAR, 2008; Valentin, 2008). Future studies should use residential history to assess the effects of indoor radon exposure across a life span.

If future studies address these limitations, then the combination of highly accurate ML techniques and the advantages and applicability in radiation epidemiology of population-level data could be harnessed for more diverse health outcome analysis. This may also provide valuable insights into the interplay between variables.

Conclusion

Traditional statistical methods and ML models can be used in parallel to fully understand the complex relationship between environmental exposures and health. To investigate the applicability of multivariable and ML methods in environmental exposure studies, county-level lung/bronchus cancer risk was assessed with various exposures (airborne gamma counts, radon concentration, air quality), lifestyle (smoking), and socioeconomic factors through Poisson regression and Poisson RF regression. The study found that the risk of lung cancer from PM2.5 varied by radon concentration with larger effect sizes in areas with high indoor radon exposure. In summary, the results of this study demonstrate how (1) including multiple environmental exposures has advantages over single exposure studies when the relationship between the environment and lung cancer risk is considered, thereby making an exposomics framework an important consideration, and (2) employing ML models enhances the utility of analysis in identifying complex relationships, as in the case of environmental radiation exposure and lung cancer incidence. Consequently, this study proposes a new paradigm for studying environmental radiation combined with other environmental exposures.