Introduction

Nitrogen dioxide (NO2) plays an important role in free radical chemistry and in photochemical processes in the troposphere and stratosphere (Crutzen 1979) and can generate ozone and fine particulate matter through complex physicochemical processes (Bidleman 1988; Odum et al. 1996; Pankow 1987). The products of these complex processes, as well as NO2 itself, can have a profound impact on the global environment (Altshuller and Bufalini 1971; Atkinson 2000). In addition to its environmental impact, NO2 can also enter the human body and diffuse through the alveoli and pulmonary capillaries to all organs of the respiratory system. The health effects of NO2 have been studied by many researchers, including the classification of NO2 toxicity (Anyanwu 1999), the complex association with various diseases (Achakulwisut et al. 2019; Hu et al. 2020; Li et al. 2020; Niu et al. 2021; Zhang et al. 2021), and premature death caused by NO2 (Chen et al. 2018; Crouse et al. 2015; He et al. 2020; Hu et al. 2021; Jerrett et al. 2013; Liu et al. 2017; Nie et al. 2021). These studies demonstrate that NO2 has important effects on human health. Therefore, it is essential to monitor the NO2 concentration and NO2 concentration trend.

Currently, the main NO2 concentration monitoring approach includes ground station and satellite remote sensing monitoring. Station monitoring has high accuracy but a small monitoring range, and there is a high uncertainty in assessing the pollution level over a large area, especially for areas far from ground stations (Boersma et al. 2008). In contrast, the near real-time continuous, large-scale area characteristics of remote sensing monitoring largely compensate for the shortcomings of station monitoring (Fishman et al. 2008; Martin 2008), providing a reliable way to measure NO2 atmospheric concentrations (Bechle et al. 2013; Cheng et al. 2019; Ialongo et al. 2020; Krotkov et al. 2016; Penn and Holloway 2020).

Satellite monitoring can obtain NO2 atmospheric column concentrations, but ground-level NO2 concentrations are more relevant to the environment and human health. Therefore, many researchers have tried to establish a mathematical model between the NO2 atmospheric column and ground-level NO2 concentrations and then use the NO2 atmospheric column concentration to retrieve the ground-level NO2 concentration (Araki et al. 2018; Gu et al. 2017; Larkin et al. 2017; Liu 2021; Xu et al. 2019; Zhan et al. 2018; Wong et al. 2021). Larkin et al. (2017) attempted to develop a land use regression model for estimating global NO2 concentrations, but the accuracy of the model differed significantly in different regions and had limited applicability. The accuracy of the retrieval model for a single country or region has been improved compared to the global model (Araki et al. 2018; He et al. 2019; Silibello et al. 2021). Many researchers have constructed retrieval models of ground-level NO2 concentrations in China, which have different spatial scales, such as regional and national scales, as well as different temporal scales, such as daily, monthly, and annual concentrations (Chi et al. 20212022; Liu 2021; Qin et al. 2020, 2017; Wu et al. 2021; Xu et al. 2019). However, the majority of these studies retrieved ground-level NO2 concentrations from 1 year or a certain number of years, without long-term research on ground-level NO2 concentrations. Additionally, they mostly focus on the changes in NO2 concentrations, with fewer studies involving the assessment of NO2 population exposure.

Traditional pollutant exposure assessments generally interpolate air quality station data spatially to represent regional pollutant concentrations (Fridell et al. 2014), and the study areas are mostly at urban or small regional scales (Fenech and Aquilina 2021; Ramacher and Karl 2020). The estimation of ground-level NO2 concentrations using satellite remote sensing data provides important support for large-scale NO2 population exposure assessments. Silibello et al. (2021) used chemical transport models and machine learning to analyze NO2 population exposure in Italy from 2013 to 2015, and a combination of both models reduced NO2 concentration underestimation, which provides data support for environmental epidemiological studies. Zhan et al. (2018) conducted a study on NO2 population exposure in China from 2013 to 2016. The study found that approximately a quarter of the population was exposed to NO2 pollution and that urbanization exacerbated NO2 pollution. Qin et al. (2017) studied NO2 exposure levels in China from 2013 to 2014 and found that NO2 exposure levels were significantly higher in densely populated areas than in other areas. Previous studies often used residential addresses to replace population distribution at small regional scales (Fridell et al. 2014; Im et al. 2018), while census data for large regions are often discontinuous in time, which becomes a major limitation for large-scale population exposure assessments (Jerrett et al. 2005). However, with the release of multiple population time series products, this limitation is well remedied, and the products can support NO2 population exposure assessments over long periods of time (Tatem 2017).

In recent years, China has undergone rapid industrialization and urbanization, but the air pollution problems associated with the development process are also very serious (Li and Zhang 2014). NO2 is one of the major air pollutants in China and has a significant impact on human health and the environment, so it is essential to research the variation in ground-level NO2 concentration and population exposure at the national scale (Xu et al. 2019; Yang et al. 2017). Currently, most studies related to NO2 population exposure are for small areas or short periods, and there is a lack of studies on NO2 population exposure in China over long periods of time. Therefore, this study aimed to conduct a long time series study on NO2 population exposure in China. First, we used ground-based monitoring data, tropospheric NO2 column concentration data from the Ozone Monitoring Instrument (OMI), and meteorological data to build a random forest model for estimating ground-level NO2 concentrations and then assess NO2 population exposure in China from 2005 to 2020 to analyze the trend and persistence of population exposure. This will fill the gap of long time series NO2 population exposure assessments in China.

The remainder of this paper is organized as follows. The “Study area and data sources” section describes the study area and dataset. The “Method” section introduces the random forest model and statistical methods. The “Result” section introduces the results of ground-level NO2 concentrations from the random forest, analyzes the changes in ground-level NO2 concentrations and population exposure over multiple years, and discusses the trends in NO2 population exposure and the persistence of changes. The “Discussion” section discusses the causes of variation in NO2 concentrations and population exposure, comparing multiple models, and the “Conclusion” section summarizes the main findings.

Study area and data sources

Study area

In this study, the land area of China was taken as the study area, with a latitude range of 7 ~ 53°N and a longitude range of 72 ~ 136°E. The terrain of this region is high in the west and low in the east. China is rich in land surface types, including basins, mountains, hills, plains, and plateaus. Additionally, China contains five climatic zones: cold temperate, middle temperate, warm temperate, subtropical, and tropical, with diverse climate types and geographical environments. In recent decades, China’s industrialization has accelerated, and its economy has developed rapidly, which has led to more serious environmental problems. Figure 1a shows the distribution of major cities in China, and Fig. 1b shows the distribution of ground-level air quality monitoring stations in China in 2020.

Fig. 1
figure 1

Overview of the study area. a The distribution of major cities in China. b The annual OMI NO2 column concentration and the distribution of NO2 monitoring stations in 2020

Ground NO 2 data

China has gradually established a national air quality monitoring network, and by 2015, the number and coverage of monitoring stations had increased significantly. In this study, hour-by-hour ground-level NO2 concentration data from January 1, 2015, to December 31, 2020, were selected from the China National Environmental Monitoring Center (CNEMC, http://106.37.208.233:20035/). We filtered the stations with at least 80% valid values throughout the year from all stations as input to the model, and finally, approximately 1450 stations passed the filtration. Since the crossing time of the OMI was approximately 13:45 min local time, the average value from 13:00 to 14:00 for each station was selected as the daily measurement.

OMI NO 2 data

The NO2 tropospheric column concentration data used in this study were obtained from OMI, which is carried out on the Aura satellite of the Earth Observing System (EOS) and obtains information by observing the backscattered radiation from the Earth’s atmosphere and the Earth’s surface. OMI can pass wavelengths in the range of 270–500 nm, with an orbital swath width of 2600 km and a spatial resolution of 13 km × 24 km. The product used in this study is the OMI OMNO2d NO2 cloud–screened tropospheric column concentration level 3 product, which is obtained by quality control on the basis of level 2 product and generates the NO2 tropospheric column concentrations by area weighting to produce gridded data with a spatial resolution of 0.25° × 0.25°. The production criteria for the cloud-screened column concentration product are zenith angle < 85°, surface reflectivity < 30%, cloud cover < 30%, and 10 < cross-orbit position < 50.

Meteorological data

The meteorological data used in the study were obtained from the fifth-generation European Centre for Medium-range Weather Forecasts atmospheric reanalysis product (ERA5) of the European Centre for Medium-Range Weather Forecasts (Hersbach 2016). We chose eight meteorological estimates with moderate resolution (0.125° or 0.25°), including the atmospheric boundary layer height (BLH), relative humidity (RH), 2 m atmospheric temperature (TEM), u-components and v-components of the 10 m wind, surface pressure (SP), total precipitation (TP), evaporation (ET), and wind speed (WS) and wind direction (WD) was calculated from the u-components and v-components of the 10 m wind.

Population data

The WorldPop dataset was developed by the WorldPop project (https://www.worldpop.org), which provides annual gridded population data for the period 2000 to 2020, and this study used the global population dataset for 2005 to 2020 with a spatial resolution of 1 km. The WorldPop dataset uses a random forest model to reallocate population numbers to the grid. The input variables for this model are the most recent official census data and a spatial auxiliary dataset. The spatial auxiliary dataset includes settlement locations and ranges, satellite nighttime lighting data, land cover data, and road and building maps. The estimated grid population is then finally adjusted to form the final dataset based on the UN Population Division’s total national estimates for the target year (Tatem 2017).

Method

Data integration

The spatial resolution of the datasets used in the study differed, so the spatial resolution of the ERA5 reanalysis product was chosen to represent all data in this study. The OMI data were interpolated to this resolution using bilinear interpolation, and all ground station measurements contained in a single grid were averaged as the ground NO2 concentration of the grid. ERA5 data were sampled to be simultaneous to the daily satellite passage time. WorldPop data were calculated as the sum of the population in the 0.125° × 0.125° grid by partition statistics. Finally, the monthly average of all data was calculated as the input to the model.

Random forest model

The random forest (RF) model is a machine learning theory proposed by Breiman (2001). The basic idea of the algorithm is to construct a certain number of decision trees and combine them according to certain criteria to generate a random forest. Due to the existence of a multilayer random process, the random forest can generate hundreds or even thousands of decision trees randomly and ensure that the decision trees constructed each time may be different due to randomness, which can be used to simulate multiple nonlinear relationships to form complex models.

Random forest regression first randomly selects the sample data by a put-back method to generate K random training sets, and the unselected part of the data forms the test sample set. For each training set, a fixed number of n (n < p) variables are randomly selected from p variables as branching nodes of the classification tree to build a regression tree, and each training set generates a corresponding regression tree. The model finally obtains the predicted values by taking the mean of the regression trees.

Model accuracy evaluation

We evaluated the performance of the random forest model by using mean absolute error (MAE), root mean square error (RMSE), and R-Square (R2). MAE is the mean absolute error and ranges from 0 to positive infinity; the smaller the value is, the smaller the error. RMSE is similar to MAE in that the smaller the value is, the higher the accuracy of the model prediction. R2 has a value range between 0 and 1; the closer the value is to 1, the better the model fit is. MAE, RMSE, and R2 are calculated by Formula (13), where \({\widehat{y}}_{i}\) is the estimated value of the i-th sample of the model, \({y}_{i}\) is the true value of the i-th sample, \(\overline{y }\) is the mean of the samples, and n is the total number of samples.

$$MAE = \frac{1}{n}\sum\limits_{i = 1}^{n} {\left| {\hat{y}_{i} - y_{i} } \right|}$$
(1)
$$RMSE = \sqrt {\frac{1}{n}\sum\limits_{i = 1}^{n} {\left( {y_{i} - \hat{y}_{i} } \right)^{2} } }$$
(2)
$$R^{2} = 1 - \frac{{\sum\limits_{i = 1}^{n} {(y_{i} - \hat{y}_{i} )^{2} } }}{{\sum\limits_{i = 1}^{n} {(y_{i} - \overline{y})^{2} } }}$$
(3)

Population exposure assessment

NO2 population exposure was obtained by weighting the ground-level NO2 concentration and the population. Since the WorldPop dataset calculates the annual population of China, we estimated the annual NO2 population exposure in China at the prefecture-level city scale. The annual NO2 population exposure level can be calculated by Formula (4).

$$E_{j} = \frac{{\sum\limits_{i = 1}^{n} {\left( {Pop_{ij} \times NO_{{2_{ij} }} } \right)} }}{{\sum\limits_{i = 1}^{n} {Pop_{ij} } }}$$
(4)

where \({E}_{j}\) is the NO2 exposure of a city in year j, \({Pop}_{ij}\) and \({{NO}_{2}}_{ij}\) are the population and NO2 concentration of the i-th grid in a given year j, respectively, and n is the number of all grids in the city.

Theil-Sen trend analysis

Theil-Sen median trend analysis is able to capture the temporal trend of each grid. Therefore, the results are able to reflect the multiyear trend of NO2 population exposure. In addition, the method does not require the sample to obey a certain distribution, which makes it highly resistant to data errors (Sen 1968). The trend is calculated by Formula (5).

$$\begin{array}{*{20}c} {S_{R} = Median\left( {\frac{{E_{j} - E_{i} }}{j - i}} \right)} & {2005 \le i \le j \le 2020} \\ \end{array}$$
(5)

where \({S}_{R}\) is the slope of the fit, \({E}_{i}\) is the NO2 population exposure in year \(i\), and \({E}_{j}\) is the NO2 population exposure in year \(j\). When \({S}_{R}\)>0, it indicates an increasing trend of the NO2 population exposure level, and vice versa, a decreasing trend.

Mann–Kendall test method

The Mann–Kendall test is a nonparametric test to determine the significance of changes in a given variable (Kendall 1955) and is calculated as follows: for the time series {\({E}_{i}\)}, \(i=\mathrm{2005,2006},\dots \dots 2020\), the Z statistic is defined as Formula (6).

$$Z = \left\{ {\begin{array}{*{20}c} {\frac{S - 1}{{\sqrt {s(S)} }}} & {S > 0} \\ 0 & {S = 0} \\ {\frac{S + 1}{{\sqrt {s(S)} }}} & {S < 0} \\ \end{array} } \right.$$
(6)
$$S = \sum\limits_{j = 1}^{n - 1} {\sum\limits_{i = j + 1}^{n} {{\text{sgn}} (E_{j} - E_{i} )} }$$
(7)
$${\text{sgn}} (E_{j} - E_{i} ) = \left\{ {\begin{array}{*{20}c} 1 & {E_{j} - E_{i} > 0} \\ 0 & {E_{j} - E_{i} = 0} \\ { - 1} & {E_{j} - E_{i} < 0} \\ \end{array} } \right.$$
(8)
$$\;s\left( S \right) = \frac{n(n - 1)(2n + 5)}{{18}}$$
(9)

In Formulas (79), \({E}_{i}\) and \({E}_{j}\) are the NO2 population exposure levels in year \(i\) and year \(j\), n represents the length of the time series, and sgn is the sign function. In this paper, we determine the significance of the trend of NO2 population exposure change at the 95% confidence level and then grade the Z value results into highly significant change (\(\left|Z\right|>2.58\)), significant change (\(2.58\mathrm{i}\left|Z\right|>1.96\)), weakly significant change (\(1.96\mathrm{e}\left|Z\right|>1.65\)), and no significant change (\(1.65 \left|Z\right|>0\)).

Hurst index analysis

The Hurst index can quantitatively describe the persistence of variables over a time series (HURST 2013); here, we used the Hurst index to analyze the persistence characteristics of NO2 population exposure. The Hurst index is calculated by Formulas (1013): for the time series {\({E}_{i}\)}, \(i=\mathrm{1,2},\dots \dots n\), and for any positive integer \(\tau \ge 1\), there is the sequence:

$$\begin{array}{*{20}c} {\overline{E}_{\tau } = \frac{1}{\tau }\sum\limits_{i = 1}^{\tau } {E_{\tau } } } & {\tau = 1,2, \ldots \ldots n} \\ \end{array}$$
(10)
$$\begin{array}{*{20}c} {X_{(i,\tau )} = \sum\limits_{i = 1}^{i} {(E_{\tau } - \overline{E}_{\tau } )} } & {1 \le i \le \tau } \\ \end{array}$$
(11)
$$\begin{array}{*{20}c} {E_{\tau } = max\;X_{(i,\tau )} - min\;X_{(i,\tau )} } & {\tau = 1,2, \ldots \ldots n} \\ \end{array}$$
(12)
$$\begin{array}{*{20}c} {S_{\tau } = \sqrt {\frac{1}{\tau }\sum\limits_{i = 1}^{\tau } {(E_{i} - E_{\tau } )^{2} } \;} } & {\tau = 1,2, \ldots \ldots n} \\ \end{array}$$
(13)

For the standard deviation \(S_{\tau }\) and range \(E_{\tau }\), if \({E}_{\tau }/{S}_{\tau }\propto {\tau }^{H}\), then the time series is said to have the Hurst phenomenon, and H is the Hurst index. When 0.5 < H < 1, the NO2 population exposure is persistent, and vice versa, it is nonpersistent.

Result

Model accuracy

We constructed a random forest model to predict ground-level NO2 concentrations by combining satellite, meteorological, population, and ground station data. In the model parameters, n_estimators was 100, max_depth was 30, max_features was 4, min_samples_split was 15, and min_samples_leaf was 15. In the model construction process, 70% of the data were randomly selected as the training set for model training, and 30% of the data were used as the test set to evaluate the accuracy of the model. Figure 2 shows the correlation between the model-simulated and measured NO2 concentrations in the training and test datasets. There was a significant correlation between the model-simulated concentration and the measured concentration. The MAE, RMSE, and R2 of the model test dataset were 4.16 µg/m3, 5.79 µg/m3, and 0.79, respectively, which were less different from the accuracy of the training dataset. Additionally, the R2 of the model was greater than 0.75 in both the training and test datasets, indicating that the model performs well in simulating ground-level NO2 concentrations. In addition, we evaluated the model by using fivefold cross-validation, and the MAE, RMSE, and R2 of the model cross-validation were 4.3 µg/m3, 5.82 µg/m3, and 0.77, respectively. The cross-validation results indicated that the random forest model has no overfitting phenomenon. Compared with the validation results in the model test dataset, the cross-validation R2 decreased by 0.02, RMSE increased by 0.03 µg/m3, and MAE increased by 0.14 µg/m3. The cross-validation results were basically consistent with the validation results in the model test dataset, which proved that the model was stable and reliable. Therefore, the simulated ground-level NO2 concentration from the random forest model can be used to analyze the spatial and temporal variations in NO2 concentration and population exposure in China.

Fig. 2
figure 2

Correlation between the model-simulated NO2 and measured NO2 concentrations. a Training dataset. b Test dataset

Temporal and spatial changes in ground-level NO2 concentrations

We studied the spatial and temporal variations in annual ground-level NO2 concentrations in China. The annual NO2 concentration was calculated from the monthly NO2 concentration predicted by the model. Figure 3 shows the distribution of the national annual NO2 concentration in 2005, 2010, 2015, and 2020, and NO2 showed spatial aggregation features. NO2 pollution was most serious in northern China, not only due to the high NO2 concentration but also due to the large area of high NO2 concentration, covering seven provinces and municipalities directly under the central government, including Henan, Hebei, Shandong, Beijing, and Tianjin. Within the region, NO2 pollution was concentrated in the south-central part of Hebei Province and the northern part of Henan Province. High NO2 concentrations in other regions were mostly found in cities with developed regions and their surrounding areas. For example, high NO2 concentrations in the southwest were located in the western part of Chongqing and Chengdu, and those in the northwest were located in Lanzhou and Xi’an. These are both provincial capitals or the main urban areas of municipalities directly under the central government. The situation in southern China was very similar to that in the west. High NO2 concentrations were concentrated in large cities such as Shenzhen, Guangzhou, and other cities in Guangdong Province. In contrast, the areas of high NO2 concentrations in eastern China were more dispersed. These areas included Shanghai, the southern part of Jiangsu Province, and the northern part of Zhejiang Province. This may be because the urbanization and industrialization levels of cities in eastern China differed less, causing NO2 pollution levels to be relatively similar.

Fig. 3
figure 3

The annual NO2 concentration in China over multiple years. a 2005; b 2010; c 2015; d 2020

Temporally, the distribution of NO2 concentrations in China showed a trend of first increasing and then decreasing. From 2005 to 2010, the annual NO2 concentration increased. Compared to 2005, the increasing trend of NO2 concentration in eastern and northern China was obvious in 2010, such as Henan, Hebei, Shandong, Jiangsu, and Zhejiang Provinces. Furthermore, NO2 pollution increased in some cities in the west, such as Chengdu, Chongqing, and Lanzhou. From 2010 to 2015, the change in NO2 concentration was slight, and there was a degree of decrease in NO2 concentration in western China. The change in other regions was not obvious. From 2015 to 2020, the NO2 concentration in China declined significantly. In 2020, the national NO2 concentrations decreased to low levels, and the extent of NO2 pollution also decreased significantly, especially in the northern provinces of China, such as Henan and Shandong. Guangzhou and its surrounding cities contained concentrated areas of NO2 pollution in southern China, but the change in NO2 concentration in this region was different from the overall change trend, which was always in a decreasing trend from 2005 to 2020.

Because the NO2 concentration differs significantly in different seasons, we selected the maximum and mean NO2 concentrations in different seasons to analyze the NO2 concentration variations in China. The NO2 concentration was low in spring and summer, while it was higher in autumn and winter, as shown in Fig. 4. The lowest NO2 concentration throughout the year occurs during the summer due to higher temperatures, which were conducive to the decomposition of NO2. Central heating in winter produces a large amount of air pollutants, including NO2, resulting in winter being the most serious season for NO2 pollution. The NO2 concentration showed a trend of increasing and then decreasing from the mean concentration change. The average NO2 concentration in each season gradually increased from 2005 to 2012 and was in the decreasing stage between 2013 and 2020. The maximum NO2 concentration reflects the serious areas of NO2 pollution. The maximum NO2 concentrations in spring and summer in 2005–2020 gradually declined, while autumn had a fluctuating upwards trend. The maximum NO2 concentration in winter has a degree of increase from 2005 to 2019 and a significant decrease in 2020. This change was likely related to the COVID-19 epidemic. In addition, the maximum concentrations in autumn and winter were both higher than 40 µg/m3, indicating that NO2 pollution was still serious in some regions in autumn and winter. Therefore, corresponding environmental protection policies need to be formulated for provinces with serious pollution in China, such as Henan and Hebei. Additionally, although NO2 pollution was very serious in some regions, it can be revealed from the variations in seasonal mean NO2 concentrations that NO2 concentrations in most regions of China were at a low level and NO2 pollution was concentrated in a small number of regions.

Fig. 4
figure 4

Seasonal variations in NO2 concentration. a Seasonal mean. b Seasonal maximum

Ground-level NO2 exposure assessment

We assessed NO2 population exposure at the prefecture-level city scale. Figure 5 shows the NO2 population exposure in 2005, 2010, 2015, and 2020. The spatial distribution of NO2 population exposure was similar to the distribution of NO2 concentration, with both having obvious spatial aggregation. The areas of high NO2 population exposure in the northwest were centred on Lanzhou, Xi’an, and Urumqi. The areas of high NO2 exposure in the southwest were centred on Chengdu and Chongqing, and the NO2 population exposure of Chengdu was significantly higher than that of Chongqing. The central region had a relatively low NO2 population exposure, and only Wuhan had a significantly higher NO2 population exposure than the other cities. In the eastern region, there were several cities with high NO2 population exposure, such as Hangzhou, Nanjing, and Shanghai. Additionally, the surrounding cities also had high NO2 population exposure, indicating that the NO2 concentration in this region was high and that the population distribution was also concentrated. Northern China is the region with the highest NO2 population exposure, with more cities at high NO2 population exposure, including Beijing, Tianjin, Shijiazhuang, Zhengzhou, and Jinan. The distribution of cities with high NO2 population exposure in Henan and Hebei was generally similar to the areas of high NO2 concentrations. Shandong Province had a relatively high NO2 population exposure due to its dense population, but the NO2 concentration in Shandong was lower than that in Henan and Hebei.

Fig. 5
figure 5

NO2 population exposure in China for multiple years. a 2005; b 2010; c 2015; d 2020

In terms of time, the NO2 population exposure in China showed a significant increasing trend from 2005 to 2010. The rising trend was most obvious in the eastern and northern cities, such as Shanghai and Hangzhou in the east and Shijiazhuang, Zhengzhou, Beijing, and Jinan in the north. These cities were mostly located in large urban agglomerations, and the surrounding cities were also very densely populated. Therefore, in these two regions, cities with high NO2 population exposure tend to be distributed in clusters. In the northwest region, the population distribution was relatively concentrated due to the smaller population, so high NO2 population exposure was usually in large cities, such as Lanzhou and Urumqi, while the NO2 population exposure in the surrounding cities of these cities was relatively low. The two major cities in the southwestern region, Chongqing and Chengdu, also increased to some degree, but the NO2 population exposure in the surrounding cities did not change significantly. The NO2 population exposure in the southern region did not increase noticeably, and some cities even decreased to a certain extent. The NO2 population exposure of some cities in central and southwest China decreased to a certain extent from 2010 to 2015, particularly in two cities, Wuhan and Chongqing. The NO2 population exposure increased in Urumqi. The remaining cities in the country showed no significant change. The NO2 population exposure decreased significantly in almost all cities from 2015 to 2020, and NO2 pollution improved substantially in 2020. However, the NO2 population exposure in Chengdu was still higher than 30 µg/m3, and the NO2 population exposure in Urumqi did not change significantly, indicating that NO2 pollution was still serious in some cities in western China.

There were 33 cities with NO2 population exposure greater than 30 µg/m3 in 2012, which was the largest number of cities in the period 2005–2020. Therefore, we chose 2012 as the dividing year to study the NO2 population exposure trends in both periods. We calculated the NO2 population exposure trends based on Theil-Sen trend analysis and then analyzed the significance of the trends based on the Mann–Kendall test. Figure 6a shows the result of the M–K trend test for NO2 population exposure in each city from 2005 to 2012. During this period, NO2 population exposure significantly increased in the majority of Chinese cities. Some areas, such as Qinghai, Tibet, northern Gansu, western Sichuan, and western Yunnan, did not change significantly. These areas are mainly sparsely populated areas and have comparatively lower NO2 concentrations. In addition, a few densely populated cities had no significant changes in NO2 population exposure, such as Beijing, Shanghai, and Suzhou. Some cities in Guangdong Province, such as Guangzhou, Dongguan, and Foshan, showed a decrease or even a significant decrease.

Fig. 6
figure 6

Trends in NO2 population exposure. a The trend from 2005 to 2012. b The trend from 2013 to 2020

Figure 6b shows the trend of NO2 population exposure from 2013 to 2020. This period was dominated by a significant decline in NO2 population exposure, but the number of cities that experienced a decline was obviously less than the number of cities that rose in the previous period, and the decline was mostly concentrated in the central, eastern, and northern parts of the country. Among the regions showing a downwards trend, Wuhan, Nanchang, and Changsha in the central region are the centre, but Hefei did not change significantly in this period. In the eastern region, Hangzhou and Shanghai were the centres. The north has the greatest number of cities with a significant downwards trend, including Beijing and Tianjin, the majority of cities in Henan and Shandong, and Harbin and Shenyang in the northeast. There was a clear difference in the west. The majority of cities in the west are dominated by no significant changes or weak downwards trends. Among the major cities in this region, Lanzhou and Urumqi in the northwest had a significant downwards trend. Chongqing and Chengdu, the two central cities in the southwest, showed a weak downwards trend or no significant change. We found that cities with downwards trends were mostly concentrated in the north, while southern cities had fewer downwards trends.

The Hurst index measures the persistence of changes in NO2 population exposure. Figure 7a shows the Hurst index of NO2 population exposure from 2005 to 2012. The increase in NO2 population exposure was mostly persistent between 2005 and 2012, especially in the southern cities. Because NO2 concentrations were relatively low in southern China, the persistent increase in NO2 population exposure indicated that the region has a strong population attraction and increasing population density that led to increasing NO2 population exposure. The north was also generally dominated by a continuous increase, but the growth of most cities in Henan was noncontinuous, and the NO2 concentrations of Henan showed an increasing trend during this period. Therefore, it may be that the population density in Henan Province decreased, resulting in a noncontinuous increase in NO2 population exposure. Figure 7b shows the Hurst index of NO2 population exposure from 2013 to 2020. The declining trend in the north from 2013 to 2020 was mostly a continuous decline, and the western cities and a small number of southern cities exhibited noncontinuous changes. The decline in NO2 population exposure in the north during this period was mostly due to policy factors, such as the enactment of strict environmental protection laws in 2015, which reduced nitrogen dioxide emissions. The western region was mostly nonpersistent in this period, which is consistent with the results of the lack of a significant trend above. In addition, there was also an impact of the COVID-19 epidemic during this period, which may explain the discontinuous changes in a small portion of the southern region.

Fig. 7
figure 7

Hurst index of NO2 population exposure. a The index of 2005–2012. b The index of 2013–2020

Discussion

In this study, we used a random forest model to retrieve ground-level NO2 concentrations in China from 2005 to 2020 and analyzed the changes in NO2 population exposure over the years. The accuracy of the model result was high; the MAE, RMSE, and R2 of the model were 4.16 µg/m3, 5.79 µg/m3, and 0.79, respectively, and the MAE, RMSE, and R2 of the model cross-validation were 4.3 µg/m3, 5.82 µg/m3, and 0.77, respectively. Apart from the random forest model, we compared different regression methods, such as the commonly used linear regression, backpropagation neural network (BPNN), and support vector machine (SVM) models. Table 1 shows the results of the multiple model comparison. The random forest model has the smallest error of all models, while the traditional linear model has the worst-fitting performance. The support vector machine model was the worst among deep learning models and has the longest computation time. The results of the comparison showed that the deep learning model has a clear advantage in large data simulations. All three deep learning methods perform better than the linear regression model. This indicates that deep learning regression is usually better than the traditional statistical model in the case of complex parameters and a large amount of data. Previous research has used a variety of models to estimate ground-level NO2 concentrations, such as the extra tree model, geographically and temporally weighted regression model, community multiscale air quality model, and land use regression model (Gu et al. 2017; Larkin et al. 2017; Qin et al. 2020, 2017). The R2 values for these models were between 0.51 and 0.7, and the RMSE values were all greater than 9 µg/m3. Compared to previous research, the random forest model used in this study significantly improved the accuracy of the simulated ground-level NO2 concentrations, and the estimation results were more reliable. In addition, the random forest model is relatively simple to implement, with low computational overhead and strong interpretability of the model.

Table 1 Comparison of multiple models

Figure 8 shows the amount of change in NO2 concentration and NO2 population exposure for 2005–2012 and 2013–2020, and we chose NO2 concentration and NO2 population exposure in 2005 and 2013 as the reference. In the first period, the changes in NO2 concentration and NO2 population exposure were generally similar in central and eastern China. Northern Shaanxi and south-central Inner Mongolia were the two regions with the highest increase in NO2 concentration, and the increase in NO2 population exposure in these two regions was also the highest in the country. Some differences in NO2 concentration and NO2 population exposure were found in the western region. For example, the rise in NO2 population exposure in Urumqi and its surrounding cities was significantly lower than the rise in NO2 concentration. Additionally, the NO2 concentrations in some regions of Tibet, Qinghai, and Yunnan increased, but the NO2 population exposure decreased, indicating that the population density in these regions was low and that the increase in NO2 concentration did not directly cause an increase in NO2 population exposure. The changes in NO2 concentration and NO2 population exposure were generally consistent in the second period.

Fig. 8
figure 8

Quantified changes in NO2 concentration and NO2 population exposure. a NO2 concentration changes from 2005 to 2012; b NO2 population exposure changes from 2005 to 2012; c NO2 concentration changes from 2013 to 2020; d NO2 population exposure changes from 2013 to 2020

The NO2 concentrations and population exposure showed a significant increasing trend from 2005 to 2012, and the NO2 population exposure persistently increased in most cities. During this period, industrial development in China was rapid. Industrial production increased from 7795.83 billion yuan in 2005 to 20,890.14 billion yuan in 2012, but the industries at this time were mostly rough industries, which seriously polluted the environment. Furthermore, the number of motor vehicles in this period rapidly increased from 43.29 to 120 million. Vehicle exhaust emissions are also a major source of NO2. Thus, the rapid growth of industry and vehicle ownership is responsible for the significant increase in NO2 concentration and NO2 population exposure during this period. Northern China, such as Henan and Hebei, has a large concentration of heavy industry and population, so the NO2 population exposure is the highest in the country.

The NO2 concentration and NO2 population exposure showed a decreasing trend nationwide from 2013 to 2020. The decline over this period was mainly due to Chinese environmental protection policies, including the mandatory cleanup of coal and the extremely stringent environmental protection law enacted in 2015. These measures have significantly limited pollutant emissions and increased penalties for companies that violate the law on emissions. Meanwhile, the new regulations imposed strict requirements on government management and incorporated the effectiveness of pollution control into the government’s performance evaluation. Chinese ecological and environmental departments issued a total of 224.8 billion yuan in ecological funds from 2016 to 2020 and environmental funds and initially established evaluation systems for air, water, and soil environmental protection. The NO2 concentrations in 2020 were significantly reduced compared to those in 2015. Although the COVID-19 epidemic also had some impact on the decrease in NO2 concentrations, life in China largely normalized in the second half of the year, and the COVID-19 epidemic had a limited impact on the NO2 concentrations throughout the year.

Conclusion

In this study, a random forest model based on OMI data and the ERA5 reanalysis product was constructed to retrieve the 0.125° × 0.125° ground-level NO2 concentrations in China from 2005 to 2020. The MAE, RMSE, and R2 of the model were 4.16 µg/m3, 5.79 µg/m3, and 0.79, respectively. The model results showed clear spatial aggregation of the NO2 concentration, which was consistent with NO2 population exposure. The average NO2 concentration in each season tended to increase and then decrease, which was consistent with the trend of the annual NO2 concentration. However, the maximum concentrations in autumn and winter still rose and were higher than the China environmental pollution standard, indicating that NO2 pollution did not improve significantly in some areas during autumn and winter.

The cities with high NO2 population exposure and areas with high NO2 concentrations basically overlap. Although the NO2 concentration was lower than that in Henan and Hebei Provinces, Shandong Province had a higher NO2 population exposure due to its dense population. NO2 population exposure increased significantly in most cities from 2005 to 2012, and most of the increase during this period was persistent. This result suggests that the increasing population density in southern China led to increased NO2 population exposure, as the NO2 concentration in southern China was relatively low. In contrast, the unsustainable increase in the NO2 population exposure in the northern cities was likely due to the outflow of the population. Most cities experienced a significant decrease in NO2 population exposure from 2013 to 2020, but the number of cities that experienced a decline was significantly less than the number of cities that rose in the previous period. The main reason for the significant upwards trend in both NO2 concentrations and NO2 population exposure from 2005 to 2012 was the rapid growth of industry and car ownership. The decline from 2013 to 2020 is mainly due to Chinese environmental protection policies.

By 2020, the southern cities still maintained low NO2 population exposure, and the eastern and northern cities significantly improved NO2 population exposure. However, the reduction in NO2 population exposure in the western region was not significant. Urumqi, Lanzhou, and Chengdu still maintained high NO2 population exposure, which indicated that the major cities in the western region require more attention. In these cities with high NO2 population exposure, the EPA could install more NO2 concentration monitoring instruments to broadcast real-time NO2 concentrations. People can avoid going to areas with high NO2 concentrations by broadcasting. For NO2 emission sources, the EPA could enforce factories to clean their emissions and encourage people to replace their fuel-powered vehicles with new energy electric vehicles.

There were also some shortcomings in this study, such as some studies suggesting that OMI data are somewhat underestimated in urban areas (Qin et al. 2020), which may lead to underestimation of ground-level NO2 concentrations in some regions. In addition, the uneven distribution of monitoring stations on the ground may also introduce errors into the model. In a follow-up study, we intend to use multiple satellite datasets or introduce more geographic auxiliary elements, such as road data, to improve the accuracy of the model and then study the human health effects from prolonged exposure to high NO2 concentrations.