Air quality and cancer risk in the All of Us Research Program

Introduction The NIH All of Us Research Program has enrolled over 544,000 participants across the US with unprecedented racial/ethnic diversity, offering opportunities to investigate myriad exposures and diseases. This paper aims to investigate the association between PM2.5 exposure and cancer risks. Materials and methods This work was performed on data from 409,876 All of Us Research Program participants using the All of Us Researcher Workbench. Cancer case ascertainment was performed using data from electronic health records and the self-reported Personal Medical History questionnaire. PM2.5 exposure was retrieved from NASA’s Earth Observing System Data and Information Center and assigned using participants’ 3-digit zip code prefixes. Multivariate logistic regression was used to estimate the odds ratio (OR) and 95% confidence interval (CI). Generalized additive models (GAMs) were used to investigate non-linear relationships. Results A total of 33,387 participants and 46,176 prevalent cancer cases were ascertained from participant EHR data, while 20,297 cases were ascertained from self-reported survey data from 18,133 participants; 9,502 cancer cases were captured in both the EHR and survey data. Average PM2.5 level from 2007 to 2016 was 8.90 μg/m3 (min 2.56, max 15.05). In analysis of cancer cases from EHR, an increased odds for breast cancer (OR 1.17, 95% CI 1.09–1.25), endometrial cancer (OR 1.33, 95% CI 1.09–1.62) and ovarian cancer (OR 1.20, 95% CI 1.01–1.42) in the 4th quartile of exposure compared to the 1st. In GAM, higher PM2.5 concentration was associated with increased odds for blood cancer, bone cancer, brain cancer, breast cancer, colon and rectum cancer, endocrine system cancer, lung cancer, pancreatic cancer, prostate cancer, and thyroid cancer. Conclusions We found evidence of an association of PM2.5 with breast, ovarian, and endometrial cancers. There is little to no prior evidence in the literature on the impact of PM2.5 on risk of these cancers, warranting further investigation. Supplementary Information The online version contains supplementary material available at 10.1007/s10552-023-01823-7.

The All of Us Research Program is enrolling a cohort of over one million participants, offering researchers an unprecedented opportunity to investigate diseases including cancers [37,38].Notably, All of Us includes participants from racial and ethnic minority groups that have been underrepresented in previous cancer research cohorts [39].All of Us may therefore confer sufficient statistical power to understand the burden of cancer in these populations and identify opportunities for intervention.In the era of precision prevention and precision medicine, investigating the role of the environment in cancer risk is critical [40][41][42].Realizing the potential of precision health will call for holistic measures of individual risk that take the physical environment into account.
We recently conducted a preliminary investigation of cancer in the All of Us Research Program [43] as part of a demonstration project to show the quality, usefulness, validity, and diversity of the All of Us data [44].We generated descriptive statistics for the most common cancers and considered differences in cancer case ascertainment compared to what would be expected in the broader US population by data source type (self-reported cancer in survey data and/ or from the electronic health record).We found that over 13,000 cancer cases were self-reported in the study population of 315,000 people and nearly 24,000 cancer cases were detected in the electronic health records collected for All of Us research participants.
Researchers currently have access to data from 409,876 All of Us participants through the Researcher Workbench, including residential data for linkage to air pollution exposure.Although the program does not target enrollment by health status, the sample includes sufficient participants with a history of cancer, prevalent cancers, and incident cancers to enable initial investigation of the role of the environment on cancer in the All of Us Research Program.Here we investigate the association between ambient air pollution and any health outcome in All of Us for the first time, and we present preliminary findings on the association of air quality and cancer in this key precision medicine cohort.We focus on fine particulate matter (PM 2.5 ), but our analysis suggests that this is only a first step toward understanding the full impact of diverse environmental factors on cancer and the extensive health outcomes collected by the All of Us Research Program.

The All of Us Research Program
Data collected from 2017 to 2022 were accessed from the All of Us Research Program, a cohort of over 544,000 adults aged 18 and over living in the United States and its territories.The goals, recruitment methods and sites, and scientific rationale for All of Us have been described previously [37].All of Us data include participants' responses to a series of questionnaires, physical measurements collected by study staff at time of enrollment, and information from participants' Electronic Health Records (EHR).These data are collected either at an All of Us affiliated health care provider organization (HPO) or through a "direct-volunteer" mechanism and are made available to researchers via the Researcher Workbench in registered, controlled, and restricted access tiers.Because zip code was required for this analysis, the data for this project were accessed at the controlled tier.

All of Us questionnaire data and physical measurements
Participant-provided information for our analysis including self-reported cancer diagnoses was derived from the Basics, Lifestyle, and Personal Medical History questionnaires.The full text of these questionnaires is available in the Survey Explorer found on the All of Us Research Hub, a publicly available website designed to support both researchers and the public [45].The Basics questionnaire elicits demographic information including age, race/ethnicity, education, marital status, household income, and geography.The Lifestyle questionnaire collects data on the use of tobacco, alcohol, and other drugs.The Personal Medical History questionnaire collects self-reported cancer history.Age at cancer diagnosis in the survey is captured as child (0-11); adolescent (12)(13)(14)(15)(16)(17); adult ; older adult (65-74); and elderly (75+).The Basics and Lifestyle questionnaires are collected at baseline.Until recently, Personal Medical History was collected during retention efforts 3 months after enrollment; participants now have the option to complete this questionnaire at the time of enrollment.Body Mass Index (BMI) was calculated using participant height and weight collected by All of Us study staff at time of enrollment; height and weight data are housed in the Physical Measurements section of the Researcher Workbench.

EHR-derived cancer diagnoses
Cancer diagnosis data were also derived from participant electronic health records linked to their All of Us data.EHRderived diagnoses were determined using Systematized

Air pollution exposure data
Daily PM 2.5 concentrations were estimated at a resolution of 1 km × 1 km across the contiguous US using a well-validated ensemble-based prediction model that integrates random forest regression, gradient boosting machine, and artificial neural networking [46].Over 100 variables were used for prediction in this approach including satellite data, land-use information, weather variables, and modeled chemical transport characteristics.We used a 10-year PM 2.5 average from 2007 to 2016 for our exposure estimate.Output from this approach has been validated with daily PM 2.5 concentrations measured at 2,156 US EPA monitoring sites.The validation results yielded an average cross-validated R-squared value of 0.86 for daily PM 2.5 predictions, indicating outperformance compared to prior approaches [47,48].
While residential addresses are not available in the All of Us Researcher Workbench, the dataset does contain 3-digit residential zip code prefix for each participant at enrollment.We therefore used zonal statistics to calculate the daily average PM 2.5 concentration based on all 1 km × 1 km grids within the zip code.Specifically, we identified the 1 km × 1 km grids with centroid in one 3-digit zip code area and then averaged daily PM 2.5 concentrations across all these grids.The average concentration was thus the PM 2.5 exposure level for participants in that 3-digit zip code area.Figure 1 shows the distribution of All of Us participants

Covariates
Following a review of known risk factors for cancer, we selected appropriate variables from the All of Us Researcher Workbench data for inclusion in all analyses.Baseline measurements of socioeconomic and demographic covariates including age (19-35, 36-50, 51-65, 65-89), sex at birth (female, male, other), race/ethnicity (non-Hispanic White, non-Hispanic Black, Hispanic/Latino, Asian, other, multiracial, none of the above), current smoking status (yes, no), education (less than high school, high school graduate, some college, college graduate), and BMI (underweight, normal weight, overweight, obese) were included as covariates in the model.

Data analysis
Data were analyzed in the All of Us Researcher Workbench.The Researcher Workbench offers a secure environment and tools to enable users to select cohorts, create datasets for analysis, and conduct analysis using R and Python programming languages in a Jupyter Notebook.We generated descriptive statistics and prevalence for 19 cancers and conducted Chi-square tests to determine the difference in the categorical distribution of data source types (survey data, EHR, and both) across key categories.Descriptive analysis was undertaken on the prevalence of cancer as well as air pollution, and how these were distributed between the different groups of the covariates.To investigate the association between PM 2.5 and cancer, univariable and multi-variable logistic regression were performed given the rare disease assumption and ability to approximate odds ratio from relative risk for interpretation convenience.Analyses were restricted to cases from the EHR to ensure that the diagnosis date did not occur prior to the exposure.We present the exposure distribution for cases obtained from different sources (EHR, survey, combined) but because date of diagnosis is not available in the survey, we did not include the survey data in logistic models.The first model introduced the unadjusted association between PM 2.5 exposure and the outcome of interest (cancer overall and by type).The second model was adjusted for age, sex at birth, race/ethnicity, smoking status, education, and BMI.PM 2.5 concentration was analyzed as a continuous variable as well as categorical variable (quartiles) in the regression models.To evaluate the non-linear relationship between PM 2.5 exposure and cancer odds, we fitted a generalized additive model (GAM) including a spline term for the accessibility score with 3 degrees of freedom and visualized the exposure-outcome response with adjustment for other covariates.Participants with missing cancer data were excluded and missing values in covariates were treated as an independent category in the analysis.All analyses were conducted using the statistical software R version 4.2.1.

Results
Table 1 shows the distribution of the mean annual PM 2.5 exposure and the baseline characteristics of all participants (n = 409,876), among whom 42,462 participants had at least one self-reported or EHR-derived cancer diagnosis.Differences in age, sex at birth, race, smoking status, education, and BMI were observed between the participants overall, with older, female, Non-Hispanic White, non-smoking, more educated, and obese participants more likely to have data on cancer history.We also note differences in cancer outcomes by data source.33,387 participants had at least one cancer in their EHR (337,292 participants had EHR data), and 18,133 participants reported at least one cancer in the Personal Medical History questionnaire (146,815 participants completed this questionnaire).9,508 participants had at least one cancer in their EHR and in their Personal Medical History questionnaire responses.However, mean PM 2.5 did not vary across these different populations.Figure 2 shows PM 2.5 levels across the 862 3-digit zip code areas included in this analysis.
Table 2 shows that All of Us participants' EHR data indicate a history of breast cancer most frequently (n = 8,433; 18.26% of cases) followed by blood cancers (n = 5,856; 12.68%), and prostate cancer (n = 5,322; 11.53%).More cancers were detected in the EHR passively as opposed to self-reported in the surveys, and the total case numbers are much lower (n = 9,502) for cancers cross-referenced in both the EHR and survey data.For the analysis of PM 2.5 and cancer risk, the case population includes cases detected in the EHR (n = 46,176) with a diagnosis date after 2006.The number of cancer cases per participant is summarized in the supplemental table.Table 3 presents cancer type distribution across the quartile distribution of PM 2.5 exposure.More than 25% of blood, brain, breast, cervical, endometrial, and ovarian cancers are observed in the highest exposure quartile (10.67-15.05µg/m 3 ).
Sex and race stratified results are presented in Supplementary Tables 2 and 3. When stratified by sex, blood cancer is significant in males.The race/ethnicity stratified Figure 3 presents the non-linear relationship between PM 2.5 and cancers with a p-value for spline less than 0.10.A non-linear relationship was observed for blood cancer, bone cancer, brain cancer, breast cancer, colon and rectum cancer, endocrine system cancer, lung cancer, pancreatic cancer, prostate cancer, and thyroid cancer.Notably, although we observed inverse associations for bone cancer, lung cancer, and pancreatic cancer in Table 4, results from GAM suggest that high PM 2.5 concentrations increase the odds for these cancers.

Discussion
In this study, the mean PM 2.5 concentration was 8.9 µg/ m 3 , in line with the WHO health-based world air-quality guideline [49,50].The highest concentration of 15.1 µg/m 3 was observed in California, while prior review reported an annual average PM 2.5 concentration of 7.0 µg/m 3 in the US [33].The difference can be explained by the spatial distribution of our study population.At present, because urban residents have easier access to All of Us HPOs, most participants are concentrated in large cities such as New York City, Chicago, and Los Angeles where the level of air pollution is generally higher than rural areas.However, even the highest PM 2.5 concentration in this study indicates a recent reduction in average PM 2.5 exposure level across the US.For instance, a US-wide cohort study based on the American Cancer Society (ACS) Cancer Prevention Study II (CPS-II) reported a median PM 2.5 concentration of 12.5 µg/m 3 between 1999 and 2008, with the highest concentration of 28.0 µg/m 3 [19].
Outdoor air pollution has been classified as Group 1 human carcinogens for lung cancer by the IARC since 2013 [31], a determination based largely on findings from outdoor air pollution exposure analysis in population cohort studies [14,15].Similarly, a recent meta-analysis reported a 9% increase in risk for lung cancer incidence or mortality per each 10 µg/m 3 increase in PM 2.5 concentration as well as an 8% (95% CI 0-17%) increase in risk per 10 µg/m 3 for PM 10 [16].Our study observed an inverse association between PM 2.5 and lung cancer.However, this inverse association was only manifest when the exposure level was low, which may reflect measurement error.In our analysis of the variables' non-linear relationship, the odds for lung cancer increased when PM 2.5 level exceeded a certain threshold.Therefore, our observation is still consistent with prior conclusions.
While the IARC has reported adverse associations between outdoor air pollution and bladder cancer [31,51], this association was not observed in our study.
Systemic inflammation, oxidative stress, and epigenetic changes induced by PM exposure [52][53][54][55] are thought to play a role in the progression of breast tumors [56][57][58][59][60], and studies from a variety of settings demonstrate an association between PM 2.5 levels and breast cancer mortality rates as well as all-cause mortality [56,61].A recent analysis of 47,433 women in the US Sister Study found adverse associations between PM 2.5 (HR per 3.6 µg/m 3 , 1.05; 95% CI 0.99-1.11)and breast cancer incidence overall (n = 2,848) [22].An analysis of 57,589 women in the Multiethnic Cohort observed adverse associations of NO x , NO 2 , PM 2.5 , and PM 10 and breast cancer incidence among those living within 500 m of major roads [26].The Canadian National Breast Screening Study (n = 89,247) found adverse associations of both PM 2.5 (HR per 10 µg/m 3 , 1.26; 95% CI 0.99-1.61)and NO 2 (HRs per 9.7 ppb, range 1.13-1.17)and the risk of incident premenopausal disease [62,63].However, no other recent studies have reported clear associations with incident breast cancer risk [23,64,65].In our study, we did observe increased risk for breast cancer associated with PM 2.5 exposure.This association was more evident when the PM 2.5 level was high.The finding is generally consistent with previous studies that present suggestive associations for breast cancer.The larger number of breast cancer cases in this study yielded larger statistical power and may explain why we could observe associations in this study.
We also observed significant increased odds for endometrial and ovarian cancers.A recent study conducted in Beijing supports the gynecologic risks associated with air pollution [66].However, in our study the mean PM 2.5 concentration was lower than 10 µg/m 3 , a level in line with the World Health Organization (WHO) health-based world air-quality guideline [48,49].Our findings warrant further investigation of these cancers in air pollution studies.
A limitation of this study is that we only examined the association of PM 2.5 with cancers while other pollutants such as SO 2 , NO 2 , NO x , and O 3 were not included.PM 2.5 is the most investigated pollutant and is often used as an indicator of overall air quality.However, the sole investigation of PM 2.5 may lead to an underestimation of the association between air pollution and cancer risks.For instance, a recent review found that a higher risk of breast cancer was associated with NO 2 and NO x , but not PM 2.5 [60].Another meta-analysis on leukemia concluded that higher exposure to NO 2 , but not PM 2.5 , was associated with higher leukemia risk.Additionally, this study only includes ambient PM 2.5 exposure level and relies on historical data.A multi-level approach accounts for multiple pollutants and sources is warranted in future studies.
To preserve participant privacy, the All of Us Researcher Workbench only offers participant data at the 3-digit zip code prefix level, rather than at the full 5-digit level which would confer higher spatial resolution for exposure estimates.As the first three digits of a zip code designate a city or a larger rural area, exposure assessment in this study may underestimate geospatial variations in air pollution.Recent epidemiological research has demonstrated the importance of within-city variability in air pollution concentration [67,68].However, the current resolution in this study is not sufficient to account for this within-city variability and thus may overlook exposure inequalities faced by urban minorities and underestimate the true associations.Another notable limitation is that we relied on the self-report and electronic health record capture of both incident and prevalent cancers and did not distinguish between primary and secondary cancers.We report differences in the effect based on the source of cancer report.The degree of impact of multiple cancers is illustrated in Supplemental Table 1.Likewise, self-report data are not sufficiently detailed to allow for finergrained analysis including reproductive or menopausal factors for breast cancer.We also found significant disparity by race in the self-reported survey data.For example, while Non-Hispanic Black participants comprised 18.76% of the overall sample population, they accounted for only 6.12% of self-reported cancers.Similarly, participants identifying as Hispanic/Latino comprised 18.03% of our sample, yet they accounted for only 5.82% of self-reported cancers.This disparity is consistent with our previous analysis of All of Us data and highlights the importance of continued engagement with populations historically underrepresented in biomedical research by both incentivizing and removing barriers to follow up data collection [43].The difference in association between cancer risk and PM 2.5 based on data source is clearly illustrated in our report.Furthermore, the representativeness of this work is limited given the sampling plan; as illustrated in Figs. 1 and 2, the health provider organizations that account for the greatest share of participant recruitment are generally located in metropolitan areas.Furthermore, at the current stage, the All of Us data used for this analysis are cross-sectional in nature as we relied on baseline data and limited longitudinal transfer of EHR.It is therefore difficult to establish temporality between air pollution and cancer outcomes and it is impossible to investigate cancer progression in relation to air pollution.However, reverse causation-the greatest concern in crosssectional studies-is not likely in this study as higher cancer prevalence does not cause higher air pollution.The association between air pollution and cancer prevalence observed in this study still supports the adverse impact of air pollution on cancer outcomes.Likewise, the cross-sectional nature of the current data also presents the limitation of a lack of "latency" or "lag" of exposure.To address this limitation our analysis used the 10-year PM 2.5 average from 2007 to 2016, aiming to cover the cancer progression stages before the study enrollment period.However, we understand that these efforts cannot completely offset the limitation induced by the study design.Some inverse associations observed in this study may be the consequence of this limitation.
The study has several notable strengths.While previous studies have been limited by small numbers of cancer cases, the sample size of this study, with more than 400,000 participants, entails the largest investigation of the association between air pollution and cancer to date.Second, research on the carcinogenicity of air pollution has long focused nearly exclusively on lung cancer, however outdoor air pollution might cause cancer at sites other than the lung through absorption, metabolism, and distribution of inhaled carcinogens.Other cancer types, including leukemia and breast cancer, have been also investigated in relation to air pollution.However, to our knowledge no study has simultaneously investigated as many cancer types as in this one.Third, the study design of All of Us will eventually enable researchers to analyze cancer risk longitudinally (although in this early analysis we are restricted to essentially cross-sectional data), thus providing additional opportunities to consider the role of air pollution in cancer occurrence and development.Many prior studies have only been able to use cancer mortality as the outcome, thus may underestimate the true odds for some cancers.
In summary, the All of Us Research Program presents significant opportunities to further evaluate the role of the environment and air pollution in cancer odds and outcomes.We have observed associations of PM 2.5 exposure with several types of cancer and risks differing by race/ethnicity.This preliminary investigation suggests that some previous findings on cancer and PM 2.5 are also observed in All of Us; for instance, our breast cancer results.Given the large and diverse All of Us study population, it may be possible to further consider the role of the environment on cancer disparities in addition to cancer risk in general.In the coming years, All of Us may confer sufficient study power to research the role of the environment in cancers that have historically been infeasible to investigate due to small sample size.This project should provide some preliminary insight and direction for future investigation.

Fig. 1
Fig. 1 All of Us participant population distribution by 3-digit zip code prefix

Fig. 2
Fig. 2 Ambient mean PM 2.5 estimates in All of Us participant locations

Fig. 3
Fig. 3 Non-linear relationship between PM 2.5 and cancer by type Nomenclature of Medicine-Clinical Terms (SNOMED CT) codes and mapped to Observational Health and Medicines Outcomes Partnership (OMOP) concept ID by the All of Us Data and Research Center.EHR data include procedures, medications, laboratory tests, and health care provider visits.Our analysis used the following OMOP parent concept IDs for cancers/cancer sites: bladder: 93689003,

Table 1
Distribution of All of Us Research Program participant characteristics by cancer case ascertainment source

Table 3
Cancer type distribution by mean annual outdoor PM 2.5 μg/m 3 quartile exposure categories a 26 cases were in 3-digit zip code areas with no PM 2.5 observations

Table 4
Cancer odds by increasing quartiles of mean annual PM 2.5 exposure a Adjusted for sex at birth, race/ethnicity, age, smoking status, education, and BMI