Background

Air pollution has long been recognized to have a wide range of effects on human health, and long-term exposure to air pollutants is associated with a range of adverse health outcomes, such as an increased risk of dementia [1], type 2 diabetes [2], and lung cancer [3], among others. These associations can lead to increased medical costs for individuals. One study in South Africa found that failure to meet the U.S. National Ambient Air Quality Standard caused $14 billion in premature death-related losses in 2012, equivalent to 2.2 percent of that year’s gross domestic product (GDP) [4].

China has adopted a number of measures to control air pollution in recent years and has made some progress; however, more improvements are expected in the future [5]. Not only did China still expose approximately 42% of its population to an annual average PM2.5 concentrations above 35 μg/m3 in 2018 [5], but ground surface ozone pollution is causing increasing concern [6, 7]. Moreover, it is estimated that approximately 1.24 million Chinese people lost their lives to air pollution in 2017 [8]. A previous study using provincial-level data from China estimated that PM2.5 could result in $25.2 billion in health expenditure in 2030, approximately 2% of GDP, in the absence of a PM2.5 pollution control policy [9].

However, studies exploring the causal relationship between long-term air pollution and individual medical costs based on large cohorts are lacking, especially in developing countries such as China. Further, a significant problem faced by the current literature is that the effects of air pollution on health outcomes are often endogenous [10, 11], and it is challenging to control for this endogeneity to verify a causal relationship [12, 13]. Another problem is that most of the existing studies have been conducted at the city level and lack individual-level evidence; this is particularly important as a significant proportion of the population does not spend any medical costs in real life due to their good health. Thus, failure to properly address this special distribution will lead to biased results [10]. Furthermore, people may respond differently to different types of air pollutants, such that some will be better protected than others [14, 15]. Accordingly, it is of interest to examine whether different patterns of pollutants have consistent causal effects on individual medical costs.

To this end, this study utilized balanced panel data from 2014, 2016, and 2018 waves of the Chinese Family Panel Study (CFPS), a large representative long-term nationwide cohort, to develop a Tobit regression model combined with correlated random effects and the control function method to explore the causal effect of long-term exposure to different types of air pollutants on individual medical costs. The findings of this study can inform policy recommendations for improved control of the medical costs associated with air pollution.

Methods

Study population

The participants of this study were drawn from the CFPS, a nationwide, representative longitudinal cohort of Chinese adults. Participants for the CFPS were sampled from 25 provinces/autonomous regions/municipalities and were selected using a stratified multistage probability strategy. The survey has been conducted every 2 years since 2010. A 5-year balanced panel of adults (N = 8928) aged 18 years or older, who had never moved out of their district/county of residence, and who were interviewed in 2014, 2016, and 2018 were selected for this study. The county/district coverage of the sample and the number of participants within each county/district were shown in Additional file 1: Fig. S1-S3.

The characteristics of the participants included in this study were also compared with those of the complete 2014 survey sample, as shown in Fig. 1. Although we were not able to include all respondents due to missing data and lost follow-up, and the results of the comparison between the two groups exhibit differences in some characteristics, the characteristics of the participants in this study did not deviate particularly from the complete population. A flow chart of the participant selection and analysis process is given in Fig. 2.

Fig. 1
figure 1

Characteristics of participants in this study compared to the full surveyed population in CFPS 2014

Fig. 2
figure 2

Flow chart of the participant selection and analysis

Air pollution and instrumental variables assessment

PM2.5 and ground surface ozone were selected as the proxy variables for air pollution in this study to explore the causal effects of different types of air pollution on medical costs. PM2.5 is one of the most concerning pollutants globally, producing one of the largest health burdens [5], while in China, ground surface ozone has received increased attention in recent years [16]. Another reason we chose these two pollutants is that the former represents those pollutants for which people may take conscious steps to avoid high levels of pollution, while the latter is the opposite. We will further elaborate on this point in the discussion. Monthly PM2.5 and ground surface ozone concentrations in China over the study period, at a 1 km × 1 km resolution, were obtained from the Tracking Air Pollution in China Database [17]. This database has been shown to have good long-term accuracy within the time frame of this study [17].

Planetary boundary layer height (PBLH) and wind speed were selected as instrumental variables in this study. PBLH reflects the depth of air at the Earth’s surface. Above the planetary boundary layer is the free atmosphere, and the transport of pollutants from the boundary layer to the free atmosphere is slow [18, 19]. Wind speed reflects the speed of air moving at a height of 10 m above the Earth’s surface. Wind speed may affect pollutant mobility and thus pollutant concentrations, while PBLH changes the concentration by changing the volume of local pollutants [20, 21]. It is generally accepted that air pollution provides the only satisfactory pathway through which wind speed and PBLH can affect health outcomes such as medical costs for residents of a specific area [20, 21]. The inclusion of two or more instrumental variables provides the opportunity to test the necessary hypotheses that make causal inferences valid [10, 11].

The ERA5 database, a global atmospheric reanalysis product with high spatial and temporal resolution, developed by the European Centre for Medium-Range Weather Forecasts (ECMWF), was accessed to obtain monthly PBLH data in China at a 2.5 km × 2.5 km resolution [22]. PBLH was measured in meters (m). The ERA5-Land was accessed to obtain monthly wind speed data in China at a resolution of 1 km × 1 km [23]. Wind speed was measured in meters per second (m/s).

In this study, participants were interviewed for three waves in 2014, 2016, and 2018, and we calculated the average PM2.5, ground surface ozone, PBLH, and wind speed exposure for each respondent for the twelve months prior to his/her specific interview time (to the exact month) for each wave. This allowed for a more accurate assessment of each respondent's instrumental variable data and long-term air pollution exposure.

Medical costs assessment

The CFPS recorded the total hospitalization costs (hospitalization was defined as admission to a hospital room for at least one night due to illness or accidental injury) in the year prior to the interview; this included all medical-related and non-medical-related costs, such as medical examinations, medical treatments, accommodation, nursing care, and transportation fees due to hospitalization, regardless of whether the individual was reimbursed by health insurance. Non-hospitalization costs due to injury or illness, i.e., costs not related to the act of hospitalization in the year prior to the interview, were also recorded, including expenses such as buying medications. Thus, the total medical costs variable was recorded as the sum of the hospitalization-related and non-hospitalization costs, and this served as the dependent variable in this study.

Covariates

The CFPS employs rigorous instruments to ensure data quality, including the use of uniformly trained surveyors, field verification and audio confirmation. The reliability of the questionnaire has been previously confirmed.

Considering that individual medical costs may be associated with a wide range of factors, a range of demographic, health-related, socioeconomic, and behavioral confounders were included as variables in this study. The demographic confounders included the age at interview and the gender of the participant. Health-related confounders included the number of chronic diseases suffered, depressive symptoms, self-rated health status, subjective memory impairment, and whether the participant was hospitalized in the last year. Socioeconomic confounders included the participant’s medical insurance, marriage status, type of household fuel, type of house, whether they obtained a subsidy, work status, and household income. In addition, the number of hospital beds per capita and income per capita in the participant’s city were recorded. Habits and behavioral confounders included drinking, smoking, exercising habits, and whether the participant surfed the internet. A set of year dummy variables was also included to control for annual unobservable fixed effects. Detailed information about the covariates is presented in Additional file 1: Text. S1.

Tobit regression model combined with correlated random effects and control function method (Tobit-CRE-CF)

The dependent variable was the total medical costs of the participants. This was a continuous variable with a value range greater than or equal to zero. If the distribution of the dependent variable is normal or approaches normal, the linear regression model (or ordinary least squares; OLS) can be employed to explore the effects of long-term exposure to air pollution on individual medical costs. However, there are many people in China who do not spend any medical costs for a significant period of time. For example, in this study, more than 30% of participants did not spend any medical costs during the entire year prior to their interview. Thus, linear regression was not applicable in this study. As such, a two-step process was utilized to explore the effect of long-term exposure to air pollution on individual medical costs: (1) estimate whether the individual could spend medical costs first and (2) estimate the fees of the individuals who could spend medical costs. This type of response variable is also called a corner solution response or corner solution outcome [24].

The Tobit regression model is appropriate when the observed range of the dependent variable is censored in some way, as in the case of the dependent variable in the current study [11, 25]. The Tobit model modifies the likelihood function so that it reflects the unequal sampling probability of each observation, depending on whether the latent dependent variable falls above or below a defined threshold [26].

In this study, because the dependent variable, the total medical costs of the respondent in the previous year, was greater than or equal to 0, it can be assumed to be censored on the left side, with a censored value of 0, and unlimited on the right side. The Tobit model in this study was designed as follows:

$$y_{it}=\left\{\begin{array}{cc}y_{it}^\ast&ify_{it}^\ast>0\\0&ify_{it}^\ast\leqslant0\end{array}\right.$$
(1)

where i denotes the individual; t is the time (year); y* is the latent measure of the individual total medical costs; X is the group of independent variables, including the proxy of air pollutants (PM2.5 and ground surface ozone), then key independent variables, and a set of covariates (also called confounders); \(\beta\) is the coefficient to be estimated; \(\mu\) is a random effect term.

Maximum likelihood estimation (MLE) was used for the Tobit regression model. The estimation process can be summarized in three steps: (1) obtain the probability density function for y in Eq. (1); (2) obtain the likelihood function based on the probability density function; (3) obtain the parameters using the Newton–Raphson method to maximize the value of the likelihood function [27, 28]. A detailed description of the Tobit regression estimation method is provided in Additional file 1: Text. S2.

The Tobit regression model is a useful benchmark but can be biased by unobserved time-invariant effects, individual-specific effects, and endogeneity [29]. Based on the relevant literature [10, 11, 24, 30], the Tobit regression model combined with correlated random effects (CRE) was employed to avoid the estimated bias caused by unobserved time-invariant and individual-specific effects, and the control function (CF) was employed to avoid endogeneity bias.

The key assumption of the CRE regression model is that the unobserved time-invariant and individual-specific effects can be denoted by the linear combination of independent variables [29]. Unobserved time-invariant and individual-specific effects can be avoided by controlling the linear combination of independent variables in the regression model. The control function (CF) can correct endogeneity problems by modeling the endogeneity in the residual. The Tobit regression model combined with CRE and CF is, thus, robust to the bias caused by unobserved time-invariant and individual-specific effects in panel data and to potential endogeneity [10, 31]. The model in this study can be described as follows:

$$\begin{array}{c}{Air pollutants}_{it}=g\left(Z,X\right)+{\eta }_{1}{\overline{Air pollutants} }_{i}+V\\ where:g\left(Z,X\right)={\beta }_{0}+{{{\varvec{Z}}}^{{\varvec{T}}}}_{{\varvec{i}}{\varvec{t}}}\boldsymbol{\alpha }+{{{\varvec{X}}}_{{\varvec{i}}{\varvec{t}}}}^{{\varvec{T}}}{\varvec{\theta}}++{{\overline{{\varvec{Z}}} }_{{\varvec{i}}}}^{{\varvec{T}}}{\varvec{\tau}}+{{\overline{{\varvec{X}}} }_{i}}^{{\varvec{T}}}{\varvec{\omega}}\end{array}$$
(2)
$$\begin{array}{c}{{y}^{*}}_{it}={\gamma }_{0}+{\gamma }_{1}{Air pollutants}_{it}+\xi V{+{\varvec{X}}}^{{\varvec{T}}}{\varvec{\theta}}{+{\mu }_{i}+\epsilon }_{it}\\ where:{\mu }_{i}=\alpha +{\varphi }_{1}{\overline{Air pollutants} }_{i}+{{\overline{{\varvec{Z}}} }_{{\varvec{i}}{\varvec{t}}}}^{{\varvec{T}}}+{{\overline{{\varvec{X}}} }_{i}}^{{\varvec{T}}}{\varvec{\rho}}+{\vartheta }_{i}\end{array}$$
(3)

where Air pollutants denotes the PM2.5 concentration and ground surface ozone concentration; Z represents the instrumental variables; X is a set of covariates; V is the residuals of the endogenous variables fitted by formula (6); y* is the latent measure of the respondent’s total medical costs, as described in formula (1); \({\mu }_{i}\) is the unobserved time-invariant and individual-specific fixed effects, which can be explained by a linear combination of endogeneity variables (\({\overline{Air pollutants} }_{i}\)), instrumental variables (\({{\overline{{\varvec{Z}}} }_{{\varvec{i}}}}^{{\varvec{T}}}\)) and a set of covariates (\({{\overline{{\varvec{X}}} }_{i}}^{{\varvec{T}}})\); and \({\epsilon }_{it}\) and \({\vartheta }_{i}\) are the random disturbance items. A detailed description of CRE and CF is presented in Additional file 1: Text. S2.

CRE-CF estimations are robust to endogeneity and unbiased only if the instrumental variables (Z) can adequately explain variations in air pollution exposure (relevant prerequisite) and lack the ability to independently explain variations in medical costs (valid prerequisite) [32]. The Cragg-Donald Wald F test and Sargan test were used to test the relevant prerequisite [33, 34] and the valid prerequisite [35], respectively. Explanations of the Cragg-Donald Wald F test and Sargan test are provided in Test S2 in the Supplementary File.

To test whether there were unobserved time-invariant effects, individual-specific effects and potential endogeneity, the Hausman specification test was employed in this study with the null hypothesis that the differences in the estimates are not systematic [36]. A detailed description of the Hausman specification tests used in this study is provided in Test S2 in the Supplementary File.

All data management was performed using R 4.0.2. The statistical analyses were performed with Stata (Version 15.0 SE, Stata Crop, Chicago, IL, USA). P < 0.05 was used to determine statistical significance. The standard errors of all estimations in the CRE-CF regression model were computed by the bootstrap method through the resampling of the dataset 100 times to obtain more robust estimates.

Results

Descriptive results

Descriptive statistics for the categorical and continuous variables are reported in Table 1. Columns (1)–(4) represent 2014, 2016, 2018, and the full sample, respectively. Due to space limitations, the text mainly presents the descriptive statistics of variables in the full sample; the details for 2014, 2016, and 2018 are shown in Table 1.

Table 1 Descriptive statistics for the key variables in this study

As shown in Table 1, the full sample included 8928 participants with a total of 26,784 observations, and the average age of the participants was approximately 50 years. The average total medical costs were 3117.08 RMB. The average PM2.5 concentration, ground surface ozone, PBLH, and wind speed were 48.49 µg/m3, 112.67 µg/m3, 488.16 m, and 1.03 m/s, respectively. There were slightly more females (51.83%) than males (48.17%). Nearly 80% of the participants across all waves of the study (78.73%) did not suffer from any chronic disease, and only approximately 9% suffered from two or more chronic diseases. The average CES-D score was 24.81, and approximately 55% of participants across all waves had no change in self-rated health condition compared to the previous year. Across all waves, nearly 87% of the participants were not hospitalized in the year prior to the interview.

Across all waves, more than 90% were married. More than 60% (63.98%) of participants across all waves used clean cooking fuel in their daily life. Overall, 44.37% of participants lived in a one-story house during the study period, while 47.22% lived in a multi-story house. More than 50% did not receive any government subsidies. Across all waves, a total of 76.89% of participants were employed and nearly 70% did not report surfing the internet. Nearly 85% did not drink alcohol more than three times a week, and more than 80% did not smoke in daily life. Nearly 60% hardly or never take exercise in daily life.

Regression results

Tables 2 and 3 display the impact of exposure to PM2.5 and ground surface ozone on individual total medical costs, respectively. Columns (1)–(5) of the tables report the regression results for the baseline linear (OLS), fixed-effects combined with two-stage least square (FE-2SLS), Pool-Tobit, Tobit-CRE, and Tobit-CRE-CF models, respectively. For each model, the variable marginal effects are reported. Moreover, the results of the Hausman test, Cragg-Donald Wald F statistic, and Sargan statistical tests are displayed in the last three rows of Tables 2 and 3. Due to space limitations, only the estimated coefficients of the main explanatory variables of interest and the results of some tests are displayed in Tables 2 and 3. The estimations for the other covariates are shown in Additional file 1: Table S1-S2.

Table 2 Estimation results of the effects of PM2.5 on the total medical costs
Table 3 Estimation results of the effects of ground surface ozone on the total medical costs

The P values of the Hausman specification tests between the estimates of the Pool-Tobit and Tobit-CRE regression models were smaller than 0.05, suggesting the presence of unobserved time-invariant and individual-specific effects that could cause bias in the estimation of the Tobit regression model. In addition, since the estimation of the residuals of PM2.5 was statistically significant, it is likely that the endogeneity of PM2.5 would cause bias, and the resulting estimation results of the Tobit-CRE model would be inconsistent. The FE-2SLS model does not consider the distribution of the total medical costs, in which more than 30% of observations were equal to 0; thus, this would lead to biased estimates. Considering the inconsistent estimations provided by the OLS, FE-2SLS, Pool-Tobit, and Tobit-CRE regression models, the following discussions are mainly based on the results estimated by the Tobit-CRE-CF regression model.

The coefficient of PM2.5 estimated by the Tobit-CRE-CF regression model was 526.396, as shown in column (5) of Table 2, and it was statistically significant at the 5% level, indicating that PM2.5 exposure concentration was positively associated with total medical costs. This suggests that long-term exposure to PM2.5 increases CFPS participants’ total medical costs. However, the PM2.5 coefficients estimated by the OLS, FE-2SLS, Pool-Tobit, and Tobit-CRE models were − 0.008, 0.052, − 16.527, and − 47.180, as shown in columns (1)–(4) of Table 2, respectively. Some of these values contrast with the estimate provided by the Tobit-CRE-CF regression model, suggesting that neglecting the dependent variable distribution and the presence of unobserved time-invariant and individual-specific fixed effects as well as potential endogeneity cause substantial biases and can even lead to the wrong conclusion.

The coefficient of ground surface ozone estimated by the Tobit-CRE-CF regression model was 198.626, as shown in column (4) of Table 3, and it was statistically significant at the 5% level. This indicates that ground surface ozone exposure is also positively associated with total medical costs. However, the coefficients for ground surface ozone estimated by the OLS, FE-2SLS, Pool-Tobit, and Tobit-CRE regression models were − 0.009, − 0.023, − 4.859, and − 46.420, respectively, as shown in columns (1)–(4) of Table 3. Some of these values are inconsistent with the estimate provided by the Tobit-CRE-CF regression model. Again, this indicates that neglecting the dependent variable distribution and the presence of unobserved time-invariant and individual-specific fixed effects, as well as potential endogeneity, cause significant bias. This highlights the importance of the empirical strategies employed in the current study.

For PM2.5 and ground surface ozone, the Cragg-Donald Wald F statistics were both greater than the Stock-Yogo weak ID test critical value of 10%, suggesting that the relevant prerequisite was valid. Namely, the instrumental variables are both strongly related to the endogenous variable. The P values of the Sargan statistics were both greater than 0.05, indicating that overidentification did not exist in this study. This indirectly suggests that the valid prerequisite of the instrumental variables was met.

In this study, the Tobit-CRE-CF regression model, a kind of limited dependent model, is used to estimate more accurate effects. However, as a compromise, it is hard to interpret the coefficients. To understand the results better, we calculate the marginal effect of the model: (1) the marginal effects for the latent measure of total medical costs (y*); (2) the marginal effects for the latent measure of total medical costs in the condition where the observed total medical costs were greater than 0; namely, the marginal effects for the latent measure of total medical costs for the individuals who could spend medical costs last year (y*|y > 0); and (3) the marginal effects for the observed measure of total medical costs in the condition where the observed total medical costs were equal to 0; namely, the marginal effects for the observed measure of total medical costs for the individuals who could not spend medical costs last year (y|y > 0). Specifically, for PM2.5 concentration and ground surface ozone, the margin effects for the observed measure of the total medical costs for the individuals who spent any medical fees last year (y|y > 0) are 199.144 and 75.145, respectively, and both significant at 5% level, which means that when PM2.5 concentration and ground surface ozone increase one unit, the total medical costs for the individuals who spent medical fee last year increase by 199.144 and 75.145 RMB, respectively. The results are shown in Additional file 1: Table S3.

Modification analysis

We preliminarily explored whether the effects of long-term air pollution exposure on individual medical costs are mitigated by gender and age, and the estimates are shown in Additional file 1: Table S4.

Column (1) and (2) of Table S4 show the results of the modification analysis of gender for PM2.5 concentration and ground surface ozone, and column (3) and (4) show the results of the modification analysis of age for PM2.5 and ground surface ozone. The estimated coefficient of the interaction term between female and PM2.5 is greater than 0 and significant at 5% level, which means the harmful effects of long-term PM2.5 exposure on individual medical costs are greater than male, while the modification effects of gender are not significant for ground surface ozone exposure. The estimated coefficient of the interaction term between age and ground surface ozone is greater than 0 and significant at 5% level, which means that the order the individual, the greater the harmful effects of long-term ground surface ozone exposure on the individual medical costs, while the modification effects of age are not significant for PM2.5 exposure.

Robustness test

Two strategies were adopted to exclude the effects of extreme total medical cost values. First, the total medical cost data were trimmed so that maximum costs of 1% were deleted. Using the same regression model and methods, the robustness of our findings was tested. Second, the upper limit of the total medical costs was set to 100,000 RMB in the Tobit-CRE-CF. In this case, if the total medical costs were more than 100,000 RMB, they were recorded as 100,000 RMB.

Columns (1) and (2) of Table 4 report the results of the robustness test of the trimmed 1% dataset. The PM2.5 and ground surface ozone coefficients were 198.836 and 72.748, respectively, and both were statistically significant at the 5% level. Columns (3) and (4) of Table 4 report the results of the robustness test based on the Tobit-CRE-CF model with an upper limit of 100,000 RMB. The PM2.5 and ground surface ozone coefficients were 484.622 and 175.289, respectively, and both were statistically significant at the 5% level. These results indicate that the PM2.5 concentration and ground surface ozone exposure concentration are positively associated with total medical costs, which is consistent with the main regression analyses. This suggests that the results are robust.

Table 4 The estimated results of robust test

Non-hospitalization medical costs were also used as a dependent variable to test whether the causal effect of air pollution on medical costs persisted after excluding hospitalization costs. Columns (5) and (6) of Table 4 report the results of this robustness test. The PM2.5 and ground surface ozone coefficients were 238.334 and 92.122, respectively, and both were statistically significant at the 5% level. These results indicate that the PM2.5 concentration and ground surface ozone exposure are positively associated with non-hospitalization medical costs, suggesting that the causal effect of air pollution on medical costs persists after excluding hospitalization costs. Only the estimated coefficients of the independent variables of interest and the results of some tests are displayed in Table 4.

Discussion

This study successfully utilized a large representative national longitudinal cohort study to construct causal models using the Tobit-CRE-CF method and found that long-term exposure to air pollutants may lead to higher medical costs for individuals, regardless of whether the air pollution is severe enough to cause an individual to be hospitalized and whether the individual may take different measures to cope with the air pollutants.

This study has the following strengths. First, to our knowledge, it is the first to assess the effects of different types of air pollutants on individual healthcare costs over time using a large, long-term, national cohort, and corresponding causal models. Currently, the literature mainly focuses on the health impact of air pollution and lacks evidence from developing countries [37], individual-level evidence [9, 38], and comparisons of different types of pollutants [39]. Moreover, most published studies do not consider causal effects [40] or the reality that a significant portion of the population may not spend healthcare costs for a long period of time. This is particularly important when exploring economic burdens, as an accurate assessment of the impacts of air pollution can be decisive in how policymakers adopt relevant measures [13].

Importantly, as mentioned above, this study provides a useful response to the problem of a significant proportion of respondents with zero medical costs incurred over the study period. China began its healthcare reform in 2009 in response to public dissatisfaction with the limited accessibility and high cost of healthcare, and while considerable results have been achieved, more progress is expected [41]. Thus, among the respondents who did not spend any medical expenses in the year prior to their interview, while some of them may have been in good health and did not need to spend any medical costs, others may have needed to seek medical help but did not due to economic conditions, the accessibility of medical services, and other reasons [41, 42]. It is not reasonable to generalize these respondents to the group with zero medical costs and directly perform OLS or IV + 2SLS [24]. Therefore, in this paper, a Tobit regression model was used to determine whether the respondents were expected to spend medical costs based on their comprehensive characteristics. This effectively prevented the bias caused by the mishandling of too many 0 values.

In addition, this study developed, applied, and validated the Tobit-CRE-CF method, which is able to deal simultaneously with time-invariant effects, individual-specific effects, endogeneity of air pollutants, and the presence of some participants who had not to spend any medical costs during the study period [10, 11, 24, 43]. In previous explorations of the relationship between air pollutant exposure and adverse health outcomes, the biggest obstacle often encountered is the endogeneity of air pollution [12, 30, 44]. One commonly cited explanation is reverse causality, i.e., endogeneity due to the possibility that people with high medical costs may engage in air pollution mitigation behaviors such as “chasing clean air” [45]. The results of the OLS, Pool-Tobit, and Tobit-CRE models in this study suggested that long-term exposure to air pollutants may contribute to lower healthcare costs even after comprehensively controlling for confounding factors. This result largely corroborates the presence of endogeneity of air pollution. Based on this consideration, the endogeneity of air pollution was controlled using the CF method. We were fortunate to identify two reasonable instrumental variables, PBLH, and wind speed, and their rationality and validity were demonstrated by a series of rigorous tests. These results verified the validity of the observed causal effects.

Long-term exposure to air pollutants may have multiple adverse effects on human health. PM2.5 exposure may cause asthma [46], atherosclerosis [47], etc., while ground surface ozone may cause cardiovascular disease [48], emphysema [49], and damage to the central nervous system [50], etc., all of which may lead to higher individual medical costs. This study explored the causal effects of different types of pollutants on medical costs using PM2.5 and ground surface ozone as two representatives. The consideration is that people may have different responses to different pollutants; for example, most people are more familiar with PM2.5 pollution, which is usually widely publicized and one of the pollutants that causes the greatest adverse health and economic burden, and are therefore more likely to take appropriate protective measures, such as reducing exposure by going out less or wearing a mask [15, 51], whereas they are less familiar with ground surface ozone pollution, which although it has gained increasing attention in recent years in the Chinese government and often has difficulty judging its severity, they are more likely to go out in brighter weather when ozone pollution is more severe, increasing exposure [7, 14, 52]. The importance of controlling air pollutants is further supported by the finding in this study that long-term exposure to different types of air pollutants causes an increase in individual medical costs.

The medical costs in this study evaluated the sum of total hospitalization medical costs and total non-hospitalization medical costs of respondents due to illness or accidental injury. It is worth noting that the costs evaluated include not only direct medical costs, such as medicine, treatment, and ward fees but also indirect medical costs, such as lodging, meals, and caregiver fees. Considering the rigorous instruments adopted by the CFPS to ensure data quality, this gives us the opportunity to comprehensively evaluate the complete economic impact of air pollution exposure. In addition, this study further validates that there is a consistent effect of air pollution exposure on the costs associated with medical actions that are not severe enough to warrant hospitalization.

Despite its strengths, there are several limitations of this study that should be noted. First, the pollutant exposure estimates in this study were made at the district/county level, which requires the assumption that respondents always stay within their district/county. However, given that only those who did not move out of the district/county during the study period were included, this is not likely to have had a substantial impact on the study conclusions. Second, despite the detailed justification of the research process in this paper, caution should be adopted when inferring the causal impact of long-term air pollution on individual medical costs, and future studies should combine multiple causal approaches to comprehensively assess the causal effects of air pollutants on medical costs while conducting a full range of heterogeneity and mechanism exploration. Third, it is possible that some minor diseases or injuries caused by long-term exposure to air pollutants would not prompt individuals to seek help and spend costs, regardless of whether they are hospitalization-related or non-hospitalization-related medical costs. In addition, data limitations make it difficult to assess the costs associated with death, which should have been included in medical cost considerations. Nonetheless, if this was the case, the current estimates would provide an underestimation of the effects of long-term exposure to air pollutants on individual medical costs. Finally, the rigorous methodology adapted in this paper to more accurately assess the causality of air pollution on health care costs also allows us to make some concessions in the ability to fully assess health care costs in China. Further research should be conducted in the future to assess the economic burden associated with air pollution.

Conclusions

Using a large, long-term, representative nationwide cohort, this study developed and validated a Tobit-CRE-CF model to control for unobserved time-invariant effects, individual-specific effects, and endogeneity and to address the issue of the large number of participants who did not spend medical costs during the study period. The results revealed that long-term exposure to air pollutants contributes to higher individual medical costs, regardless of whether the air pollution causes individuals to get sick enough to require hospitalization or whether individuals may adopt various measures to minimize their pollutant exposure.