Introduction

Air pollution represents the biggest environmental risk to health, and outdoor air pollution alone has been associated with an estimated three million premature deaths in a recent year (World Health Organization (WHO), 2016). Harmful substances, such as carbon monoxide, nitrogen dioxide, and heavy metals from natural and unnatural origins, including volcanoes to industrial activities, vehicle emission, and combustion of fossil fuel, can take the forms of solid particles, liquid droplets, or gasses (Fiordelisi et al. 2017). Fine particulate matter with a size less than or equal to 2.5 μm in diameters (PM2.5) is the sum of such solid or liquid pollutants suspended in the air and particularly important as it tends to stay longer in the air because of its small size (Lin et al. 2018), increasing the chance of inhalation.

PM2.5 has been shown to have significant adverse effects on human health, including reduced lung function and increased risk of chronic bronchitis, heart disease, lung cancer, and various forms of cardiovascular and cerebrovascular diseases and mortality (Fiordelisi et al. 2017; Lu et al. 2015). Mechanisms linking PM2.5 and human health outcomes have not been fully understood, but unlike larger particles, small particles like PM2.5 can penetrate deep into the respiratory system or absorbed into the bloodstream and may increase systemic inflammation and alter autonomic nervous system activity (Fiordelisi et al. 2017; Shah et al. 2015).

Although air quality has been improved over the years, many cities and regions in European and Asian countries still suffer from PM2.5 concentration higher than 10 μg/m3 on an annual average, above the limit recommended by the WHO (Thunis et al. 2017). An increase in PM2.5 concentration by 10 μg/m3 was associated with 8–27% higher lung cancer risk (Pope et al. 2002; Turner et al. 2011) and 4% higher all-cause mortality risk (Pope et al. 2002), while a reduction by 10 μg/m3 expands life expectancy for 0.35 years (Correia et al. 2013). Emergency hospital admissions for cerebrovascular disease increased by 1.3% for every ≥ 10 μg/m3 increase in PM2.5 concentration (Leiva et al. 2013).

The aim of our study was to evaluate the acute effect of PM2.5 pollution on cerebrovascular disease (CVD) mortality using data from Shanghai, China, between 2012 and 2014. We focused on the influence of PM2.5 concentration on death during the same day. As we adjusted for a number of meteorological variables and some of them show multicollinearity, the secondary aim was to examine the impact of the difference of adjustment techniques that are commonly used in the presence of collinearity: principal component analysis (PCA), shrinkage smoother, the least absolute shrinkage and selection operator (LASSO) methods.

Materials and methods

Study setting and data

The data used this study include daily observations on the number of CVD deaths, PM2.5 concentrations, and meteorological conditions between 2012 and 2014 in Shanghai, China, with a total of 1091 observed days. During the study period, the population of Shanghai was approximately 24.0 million, and the life expectancy was around 83 years (Shanghai Statistics Bureau 2014). The detailed description of the data was published elsewhere (Fang et al. 2017). Briefly, daily average PM2.5 concentrations between January 1, 2012, and December 31, 2014, were obtained from the air quality monitoring station of the US Consulate General in Shanghai and the Shanghai Meteorological Bureau. Only the measurements from a single air quality monitor station were available during the study period. The daily mortality data during the corresponding time period for all the 16 administrative districts in Shanghai were obtained from the Causes of Death Registry of Shanghai Municipal Centre for Disease Control and Prevention. The causes of death were coded according to the International Disease Classification Codes, version 10, and the codes for cerebrovascular diseases deaths were I60–I69. Thirteen meteorological variables and the day of the week were used as potential confounding factors.

Statistical model

Mean, standard deviation (SD), median, the first quartile (Q1), and the third quartile (Q3) were calculated for daily CVD deaths, PM2.5 concentrations, and meteorological variables. Univariate associations between CVD mortality and PM2.5 and meteorological variables were evaluated using a generalized linear model with a log link function. Generalized additive model (GAM) was used for multivariate analysis. The GAM is a generalization of the generalized linear model and is widely used in time series studies on health effects of air pollution because it does not expect a particular functional form of a relationship, and is flexible for modelling nonlinear associations (Crawley 2013). It exploits the quantity of a response variable Y from a given distribution with different independent variables by estimating non-parametric functions of the independent variables, which are connected to the response variable through a link function. Because of the nonlinear relationship between death counts and weather conditions, and exponential form of daily death counts within a fixed period of time, the effect of PM2.5 on daily deaths in the present study was modelled as:

$$ \mathrm{Log}\left[E(Y)\right]=\mathrm{Log}\left[{\mu}_t\right]={\beta}_0+{\beta}_1{X}_t+{\sum}_{i=1}^n{f}_i\left({Z}_i\right)+{\beta}_j{D}_j $$
$$ \mathrm{and}\ Y\sim \mathrm{Poisson}\ \left({\mu}_t\right) $$

where E(Y) is the expected mean of deaths on day t, β0 and β1 are the intercept and slope term respectively, Xt is the PM2.5 concentration on a day t, f is the smoothing function, Zi is the confounding variables (i.e., time and weather conditions), and Dj is the dummy variables of the day of the week (DOW) and j = 1, 2, … , 6.

The smoothing functions fi are composed of non-parametric splines of confounding variables. Splines are polynomial curves that are connected at inner knots. The widely used splines in GAM are defined as the natural spline, cubic spline, or B-spline with a pre-specified number of knots. In our study, two types of smoothing splines, cubic regression spline (cr) and thin plate spline (tp), were used and compared. To avoid over-fitting, for each of confounding variables of time and weather conditions, the optimal number of knots (k) was selected based on Akaike’s information criterion values using iterative processes. We compared k = 4, 5, 6, 7, and 8 because most of the studies show that k = 3 is sufficient, and often there was no notable improvement with more than 8 knots.

The coefficient of PM2.5 multiplying by 10 is interpreted as a percentage increase in cerebrovascular deaths per 10 μg/m3 increment in PM2.5 concentration.

Controlling for multicollinearity

Except for controlling for temporal trends in daily cerebrovascular deaths, meteorological variables were considered as potential confounding factors (Jimenez-Conde et al. 2008). Multicollinearity among 13 meteorological variables was examined using the Spearman correlation and variance inflation factor (VIF), and statistically significant multicollinearity was found among the variables. To deal with the multicollinearity and high dimensionality of these confounding variables, we used three different approaches. The first approach used the principal component analysis (PCA) method to convert the variables into mutually orthogonal principal components (PCs) (Yang et al. 2015). Because the reduction of the number of variables, or dimensionality, on the other hand, may lead to a loss of useful information about the original data, we also repeated the analysis using shrinkage smoothers as an alternative approach. Shrinkage is the procedure of compressing extreme values towards a central value (Tibshirani 1996). The benefit of using the shrinkage smoothers is to shrink less important variables into zero (Marra and Wood 2011). When using PCA and shrinkage techniques, we applied both cr and tp splines. The third approach used the least absolute shrinkage and selection operator (LASSO) regularization, which selected variables using a penalty α = 1 (Friedman et al. 2010).

All analyses were conducted in R version 3.5.1 (R Foundation for Statistical Computing, Vienna, Austria) using packages psych, mgcv, and glmnet. Two-sided p values less than 0.05 are considered statistically significant.

Results

Descriptive statistics and univariate association

The descriptive statistics of daily mortality caused by cerebrovascular diseases, PM2.5 concentrations, and the meteorological variables are presented in Table 1. The mean of daily PM2.5 concentration is 55 μg/m3. The daily average deaths for the cerebrovascular diseases are 62 over the study period. The estimated coefficient with PM2.5 is 1.653 × 10−3, corresponding to a 1.653% increment in daily CVD deaths per 10 μg/m3 increment in PM2.5 concentration. All meteorological variables other than extreme wind speed showed a statistically significant association, and higher value tended to be associated with lower death risk apart from the daily atmospheric pressure variables.

Table 1 Descriptive statistics and univariate association for the risk of cerebrovascular disease death

There was multicollinearity among PM2.5 and meteorological variables (Table 2). Most of the correlation coefficients are statistically significant. For example, daily average atmospheric pressure has a highly negative relationship with the average temperature (r = − 0.87), average humidity is negatively correlated with the sunshine (r = − 0.67), and volume of rain is negatively correlated with the sunshine (r = − 0.61). According to the VIF values, daily average pressure, maximum pressure, minimum pressure, daily average temperature, maximum temperature, minimum temperature, daily average humidity, minimum humidity, maximum wind speed, and extreme wind speed are highly correlated to other variables and daily average rain volume, wind speed, and sunshine are moderately correlated to other variables (Table 3) (O’Brien 2007). For example, the VIF value of daily average pressure indicates that the variance of the estimated coefficient of the variable is inflated by a factor of 273.34 as it is highly correlated with at least one predictor variable. The correlations between PM2.5 and meteorological variables are rather weak, and most of them have a Spearman’s r < 0.30 (Table 3).

Table 2 Spearman correlation coefficients between the meteorological variables
Table 3 Variance inflation factor (VIF) of meteorological conditions

Results of the GAM-PCA approach

The first four principal components derived from PCA explained approximately 91% variation of the original data (Table 4). After adjusting for meteorological conditions and day of the week, there was no longer a statistically significant association between daily PM2.5 concentration and CVD mortality in both cr and tp spline models. All the smoothing functions of the PCs are statistically significant except for the fourth PC.

Table 4 Summary of the GAM-PC analysis for the risk of cerebrovascular disease mortality

Results of the GAM shrinkage smoothers approach

Both shrinkage smoothers (cr and tp) shrank daily average atmospheric pressure, maximum atmospheric pressure, minimum atmospheric pressure, minimum temperature, average humidity, average rain volume, average wind speed, maximum wind speed, extreme wind speed, and sunshine to zero (Table 5). Again, no statistically significant association was found between daily PM2.5 concentration and CVD mortality after adjusting for meteorological conditions and day of the week. The smoothing terms of average and maximum temperature and daily minimum humidity retained statistically significant.

Table 5 Summary of the GAM with shrinkage smoothers analysis for the risk of cerebrovascular disease mortality

Results of the GAM-LASSO approach

Daily average temperature and minimum temperature were selected using the LASSO procedure and included in the model (Table 6). The association between daily PM2.5 concentration and cerebrovascular mortality became no longer statistically significant after adjustment, but the smoothing term of average temperature retained statistically significant.

Table 6 Summary of the GAM analysis with selected variables by LASSO

Discussion

In the current study, we evaluated the effect of daily PM2.5 concentrations on the CVD deaths in Shanghai, China, using GAM analysis with three different approaches for controlling for collinear confounding variables while simultaneously taking into account the linear and nonlinear relationships between meteorological confounding variables and the number of deaths. The daily average PM2.5 concentration level in our data was 55 μg/m3, and this was much higher than the recommended levels for PM2.5 yearly average below 10 μg/m3 or 25 μg/m3 by the World Health Organization or the European Union, respectively (Thunis et al. 2017). There was a 1.7% elevated risk for CVD death per 10 μg/m3 PM2.5 concentration in the unadjusted model, but after adjustment for meteorological variables and temporal trend, the exposure to PM2.5 was no longer associated. The results were consistent in the three modelling techniques we used.

A body of research has shown that an increase in PM2.5 was linked to an elevated risk of stroke (Franklin et al. 2008; Leiva et al. 2013; Lin et al. 2017; Lisabeth et al. 2008; Shah et al. 2015; Wellenius et al. 2012), ischemic heart disease (Pope et al. 2002), myocardial infarction (Peters et al. 2001), and cerebrovascular mortality (Gutierrez-Avila et al. 2018), and recent reviews showed that PM2.5 was associated with approximately 1% elevated risk of cerebrovascular mortality by every 10 μg/m3 increase (Shah et al. 2015; Wang et al. 2014). These studies revealed an increased risk of CVD associated with short-term exposure to PM2.5 in Europe, the USA, Asia, Africa, and South America. The excessive risk per 10 μg/m3 increase in PM2.5 concentration ranged from 0.7% in mortality to 1.3% in hospital admission. In our data, the unadjusted association was similar but higher (1.7% excessive mortality per 10 μg/m3 increase in PM2.5 concentration). It has been proved difficult to quantify premature mortality related to air pollution, notably in regions where air quality is not systematically monitored, and also because the toxic particles from various sources may vary (Tuomisto et al. 2008). The estimated effect of PM2.5 on premature mortality largely depends on the toxicity regarding the inhaled particle components. In China, emissions from residential energy use such as heating and cooking have the largest contribution to PM2.5, whereas in much of the USA and in a few other countries, emissions from traffic and power generation are important. In the eastern United States, Europe, Russia, and East Asia, agricultural emissions make the largest relative contribution to PM2.5 (Lelieveld et al. 2015). It might partially explain the difference in the findings between our study and other studies. On the other hand, the difference might be in part due to the profoundly different demographic characteristics, socioeconomic status, or environmental conditions. For example, Shanghai is a megacity with a dense population and high temperature and humidity all the year round, and PM2.5 may impact public health differently from other studied areas. Like the aforementioned studies, our study also controlled for the temporal trend of deaths and confounding from meteorological variables. However, we notice that when we adjusted for quite a few meteorological variables, the association became no longer statistically significant. The results were consistent across different modelling strategies, while the nonlinear association of CVD mortality with the meteorological variables retained. The observed PM2.5 concentrations in Shanghai were substantially higher than those in Europe or the USA where most of the current studies have been conducted and conclusions derived. The high ambient PM2.5 concentration level in itself might mask triggering effect of exposure, and the meteorological factors were only contributing to exacerbation of a PM2.5 exposure effect, which might be a potential reason that no statistically significant effect was observed for PM2.5 and deserves further investigation in the comparative studies using data from regions with low PM2.5 pollution..

The strengths of our study include, first, the adjustment using the rich information of meteorological variables enabled us for detailed control for weather conditions. Weather conditions and time-varying risk factors such as days of the week may cause a significant modification on PM2.5 levels and cerebrovascular events (Zhang et al. 2014). Second, multicollinearity among meteorological variables was handled using different approaches of PCA, shrinkage smoothers, and LASSO regularization, and the results were consistent in all the methods. Third, the use of two types of smoothing splines for the GAM model allowed us to compare results to minimize the bias from spline selection, and the results were, again, consistent regardless of types of smoothing splines. In the multicollinearity context, shrinkage methods, such as ridge regression, may reduce the dimensionality of the data by shrinking some coefficient estimates towards zero but not exactly to zero. While in LASSO, one of the correlated coefficients is usually zeroed and the other is assigned the entire impact. Because of this, ridge regression is expected to work better if there are many large parameters of about the same value. LASSO, on the other hand, is expected to come on top when only a few factors actually have impacts (James et al. 2013)

However, our study also has limitations. First, the relatively shorter time period for the analysis limited us to fully assess the long-term time trend in both PM2.5 pollution levels and cerebrovascular disease mortality. Second, only city-level PM2.5 concentrations from one air monitoring station were available in our study, but the concentration of PM2.5 may differ within the city and change during the day, and people’s location would also not be constant. As a megacity with a population of about 24 million and an area of 6,340 km2, Shanghai has vastly different PM2.5 concentrations and meteorological conditions across the city. Although deaths from 16 administrative districts in Shanghai were available, only aggregated deaths of the whole city and the PM2.5 concentrations, as well as meteorological variables, from a single monitoring station were available in the study. It may mask and obscure the spatial and temporal variability of PM2.5 effects at particular exposure hotspots, such as the heat island effect in some parts of the city may exacerbate the effects of PM2.5 due to high temperatures that may also affect the outcomes. To overcome the lack of spatial variability in PM2.5 concentrations and/or meteorological data, a land use regression approach could be incorporated in the future (Liu et al. 2016). Of course, air pollution and mortality data from multisite would be more helpful for adjusting for the confounding from the spatiotemporal variability in PM2.5 concentrations and mortalities. Although many studies relied on air pollution information assuming a constant location of people at, for example, their residential addresses, the quantification of exposure using time and activity patterns of individuals will also enhance the understanding (Reis et al. 2018). Third, cerebrovascular mortality risk may vary by age, sex, and socioeconomic factors, but these characteristics were not controlled for in the current study. However, these characteristics tend to be stable within the city given our relatively short study period, thus confounding is unlikely (Pope et al. 2002). Fourth, we focused on the same day’s effect of a single pollutant and did not include multiple pollutants (Wang et al. 2014) or lagged effects (Gutierrez-Avila et al. 2018). Although it is possible that air pollution may cause death after a certain period of time, a systematic review indicates that the risk appeared hardly different by the inclusion of lags (Shah et al. 2015).

Conclusion

In this study, we aimed to evaluate the effect of PM2.5 pollution on CVD mortality using GAM with three different approaches for controlling for confounding factors. The initially statistically significant 1.65% elevated risk of CVD death was no longer observed after adjusting for a number of meteorological variables. As a large number of people are exposed to air pollution, further analysis using data with various measurement times, periods, and detailed pollutant and exposure profiles would contribute to enhancing the understanding of the impact of PM2.5 on human health and its mechanism.