Introduction

China, the world’s second largest economy, has gone through severe atmospheric deterioration for decades, which have slowed down the economic growth rate and implacably elicited 1.6 million premature deaths in 20171. Air pollution, having elevated daily hospital admissions in 218 cities of China2, leads to the increase of cardiopulmonary/cardiovascular diseases, respiratory infection and hypermethylation3. Atmospheric particulate matter (PM) draws the most public concerns among air pollutants, because of its toxicity and carcinogenicity4. PM can be a host for bacterial and fungal pathogens5,6. It has been found that a 1% rise of PM2.5 enhances 2.9% of healthcare expenditure in China7. Moreover, atmospheric aerosols are possible contributors to weather and climate change8,9,10,11.

The atmospheric particulate matter with an aerodynamic equivalent diameter of less than 10 μm (PM10), whose main emission sources are the coal-based thermal power plants, coal-based domestic heating, automobiles, fugitive dust from roads, construction sites, and unpaved soil12, is studied in this work. The reason why we select PM10 for investigation is it can greatly affect human’s health, bringing in numerous disease burdens13,14,15. Also, the source of PM10 partly originates from long-range transported sandstorms16. Investigation on the significance of long-range transport and indigenous emission is of great importance. Several previous works investigated PM in the megacities of China via outdoor observation17,18,19. Machine learning, orchestrated for developing algorithms automatically from large datasets, removes the need for an air pollution emission inventory which is a linchpin for conventional atmospheric models, thus becoming a more flexible approach20,21,22,23. Compared to inventory-predicated air quality models, machine learning offers an alternative and more accurate method to interpret air pollutant concentration, which now is a popular topic in atmospheric research field. Feng et al.23 proposed an avenue to forecast the air pollutants in Hangzhou using machine learning. Chen et al.24 used deep neural network to estimate PM2.5 concentrations across China. Yan et al.25 developed a deep learning model to improve the interpretability and predictive accuracy of satellite-based PM2.5. Han et al.26 estimated air qualities in Beijing during 2008–2012 by Bayesian Multi-task Long Short-Term Memory. In this work, we select Recurrent Neural Network (RNN) and Random Forest (RF) to conduct a nationwide survey of PM10. The concentration of PM10 is much higher in winter than in the other seasons, so we focus on the wintertime (December, January and February) PM10 in the past more than five years (December 2014 to February 2019).

The scopes of this work are as follows: (1) finding the different regional PM10 patterns and its determinants; (2) exploring the contributors of severe wintertime haze in a novel perspective and demonstrating of the insignificance of long-range transport. In Section two, we introduce the study areas, the sources of data, and parameters of two machine learning models. In Section three, we illustrate the causes for severity of haze in wintertime and show the reason why the long-range transported PM10 are insignificant except in sandstorms.

Methods

Investigated areas

The Beijing-Tianjin-Hebei plain (BTH) (37°–41° N, 114°–118° E), the Yangtze River Delta (YRD) (30°–33° N, 118°–122° E), the Pearl River Delta (PRD) (21.5°–24° N, 112°–115.5° E) and the Si-chuan Basin (SCB) (28.5°–31.5° N, 103.5°–107° E) are the four most prosperous but polluted regions in China, representing the center of North, East, South and West China, respectively. BTH, where the capital city of Beijing and the central municipality of Tianjin nestle, had a population of 110 million and produced over 10% of China’s national gross domestic product (GDP) in 2017. YRD, where the megacity of Shanghai resides, denotes the economic center of China, accounted for 19% of China’s GDP and had a population of 150 million. The PRD urban agglomerations surrounding Hong Kong and Macao created nearly 13% of China’s GDP with a population of 83 million. SCB, the economic and political center of West China, contributed a population of 114 million and 7% national GDP. These four regions comprised 33% of the Chinese population, 8% of China’s land, and 49% of GDP of China in 2017. However, all of these regions have suffered from severe PM10 for decades due to the rapid industrialization. In order to develop better control measures, the question emerges as whether the regional patterns of PM10 are the same. Because of the regional heterogeneity of natural and anthropogenic sources of PM10, a reasonable assumption is the determinants of PM10 varies among regions but remains consistent in the same region. Nine regionally representative core cities, which are Beijing and Tianjin in BTH, Shanghai, Nanjing and Hangzhou in YRD, Guangzhou and Shenzhen in PRD, and Chengdu and Chongqing in SCB, are picked to investigate the regional PM10 patterns in wintertime. These nine cities, each of which has more than nine million citizens, are the most flourishing areas of China with their ever-growing urbanization. According to census, the permanent residents living in these nine cities were 154 million in 2017.

Data of wintertime air pollutants and meteorology

All the data used in this work are publicly accessible online. The time period studied was sifted to be wintertime (December, January and February) from 1 December 2014 to 28 February 2019. Hourly air pollutants, including sulfur dioxide (SO2), nitrogen dioxide (NO2), tropospheric ozone in the surface air (O3), carbon monoxide (CO), PM2.5 and PM10 were extracted from official website of China National Environmental Monitoring Centre (http://beijingair.sinaapp.com/), where the air pollution data from 1563 environmental monitoring sites across China were recorded and documented. We chose the environmental monitoring sites in the nine investigated cities for training and testing. We use the data from all of the environmental monitoring sites in a city to calculate Feature Importance. Then we take the average of them to predict hourly PM10 in Scenario one and two. The meteorological data were from the NASA Global Modeling and Assimilation Office (https://gmao.gsfc.nasa.gov/reanalysis/MERRA-2) and the University of Wyoming (http://www.weather.uwyo.edu/surface/meteorogram/seasia.shtml), including hourly temperature, relative humidity, atmospheric pressure, wind speed and wind direction.

Parameters of random forest and RNN

Recurrent Neural Network (RNN) is capable of capturing temporal contextual information, suitable for simulating the accumulation and deposition of air pollutants. RNN can transfer information from one step to the following step. Random Forest (RF), a tree structuring model, is able to quantitatively rate the significance of each input in shaping the output via calculating the Feature Importance (FI). There are two types of Feature Importance, which are Variable Importance and Gini Importance. In this case, we chose Gini Importance.

Several setups of RF and RNN were tested and fine-tuned before we selected the best settings of parameters. As for RF, n-estimator is the number of built trees. A higher n-estimator ensures the predictions to be stronger and more stable, but also makes the operator code slower. Increasing max-features generally improves the performance of Random Forest, but decreases the diversity of individual tree and slows down the running speed. To strike the right balance, assigning maximum features to be auto to take all features into consideration and put no restriction on the individual tree. Max depth being none means the node extends until all leaves are pure or all leaf nodes contain fewer samples than min samples split, which is set as two in this work. Min sample leaf is the minimum sample number on leaf nodes. Max leaf nodes are the optimal nodes defined by a relative reduction in purity in the best-first fashion. Max leaf nodes being none means there is no restriction on the number of leaf nodes. As for RNN, the activation function chosen was the most popular non-linear function rectified linear unit (ReLU), expressed as \(f\left( x \right) = \max \left( {z, 0} \right)\). As the number of the hidden units becomes larger, the prediction accuracy of RNN slightly increases but the running speed is slowed down. In this case, we choose the number of the hidden units to be 300. Learning rate is typically log-spaced and change of it commonly does not make significant improvement. We choose learning rate to be 10–3. Lay number is set to be 2, because two-layer enables RNN more accurate than single-layer in predicting PM10, as we’ve tested.

Results and discussion

Feature importance of PM10

Feature Importance (FI), calculated by Random Forest, is able to quantify the significance of each input to impact the output. The higher the score that an input gets, the more significant that input is to the output. The hourly meteorological conditions and air pollutants in the wintertime of past more than five years (December 2014 to February 2019) were input to calculate the long-term FI of PM10, shown in Fig. 1. First and foremost, Fig. 1 quantitatively demonstrates that gaseous air pollutants (SO2, NO2, O3 and CO) were more significant than the meteorological conditions in shaping PM10, as the FI of gaseous air pollutants outscored that of meteorological conditions combined. SO2 and NO2 were positively correlated with PM10, because they were the precursors of sulfate and nitrate, the main components of PM1027. Tropospheric O3 in the surface air and PM10 were negatively associated, because PM10 is a promoter that speeds up the aerosol sink of hydroperoxy radicals28. The strongly positive association between CO and PM10 was because they were emitted from same sources, such as coal-base domestic heating and traffic. The possible chemical bonds between CO and PM10 need further investigation. As for Beijing and Tianjin of BTH, the influence of CO on PM10 was far greater than that of other gaseous air pollutants and NO2 contributed more pivotally than SO2 for PM10. As for Shanghai, Nanjing and Hangzhou of YRD, SO2 played a more crucial role than NO2 in reproducing PM10. The influence of CO on PM10 was also predominant in YRD but less critical than that in BTH. As for Guangzhou and Shenzhen of PRD, NO2 and SO2 had higher FI than CO, revealing a different pattern of PM10 in stark comparison with BTH and YRD. As for PM10 in SCB, CO and NO2 were the primary FI in Chengdu and Chongqing, respectively. Therefore, the spatial heterogeneity of regional PM10 in China is corroborated. We then calculate the annual FI for PM10 from the aforementioned nine cites, shown in Table 1. Despite of the ebb and flow of FI in some year, the results are consistent for wintertime PM10 in a city. CO is associated with the insufficient combustion in the coal-based house heating while NO2 is mainly emitted by automotive vehicles, curbing coal-based house heating in BTH/YRD and controlling vehicles in PRD and SCB are the best ways to lower PM10.

Figure 1
figure 1figure 1

Feature Importance for PM10 in wintertime from December 2014 to February 2019.

Table 1 FI of wintertime PM10 in nine regional core cities in Scenario one.

Prediction of PM10 using SO2, NO2 and CO as inputs

Due to the leading roles that gaseous air pollutants (SO2, NO2 and CO) play in shaping PM10, they are used to predict hourly PM10 without meteorological circumstances. Training period is set to be December and February while testing period is January (Scenario one). Training and testing data are from the same city. Pearson correlation coefficient (R) and Root Mean Square Error (RMSE) are used as two statistic indicators to evaluate the performance of RF and RNN, and the results are shown in Table 2 and Fig. 2. As Table 2 indicates, both RF and RNN show good accuracy in simulating hourly PM10 with only three gaseous air pollutants as inputs. In most cases, the Pearson correlation coefficient (R) between hourly observed and RF/RNN-simulated data is larger than 0.8. RNN is related with time series, as it recursively associates the dataset in the direction of sequence evolution. However, in this case, RNN’s not outperforming Random Forest in all nine cities signals that PM10 was not strongly linked to the time series with one hour interval. This finding reveals that, compared with the impact of gaseous pollutants, the concentration of PM10 at a given time-point is more relevant to the gaseous air pollutants at the same time than to their previous levels one hour prior. Also, when using the gaseous air pollutants in timestamp (T-1) as inputs, the performances of RF and RNN are slightly worse for predicting PM10 in timestamp T, compared with that using the gaseous air pollutants in timestamp (T) as inputs. Moreover, the Pearson correlation coefficient of PM10 in timestamp T and concomitant gaseous pollutants in timestamp T is greater than that of PM10 in timestamp T and gaseous pollutants one hour prior in timestamp (T-1). This finding not only unravels that PM10 and gaseous air pollutants were in thermodynamic dynamic equilibrium, but also implies the formation and deposition of PM10 tended to occur in less than one hour. Furthermore, when training data and testing data are extracted from different cities, the prediction accuracy is reduced, implying every city had its own unique pattern of PM10.

Table 2 Performance of machine learning in predicting hourly PM10.
Figure 2
figure 2figure 2

Observed and simulated PM10 in January 2019: Scenario one.

Thermodynamic equilibrium between gaseous air pollutants and PM10

As Fig. 2 and Table 3 show, both RF and RNN ubiquitously underestimate PM10 in all nine cities in Scenario one. In contrast with Scenario one, Scenario two is set as the testing period is hourly PM10 in one day in January 2019 and the training period is hourly PM10 in the remaining days in January 2019. Training and testing data are from the same city. Inputs include SO2, NO2 and CO as well. The results are given in Fig. 3. As Fig. 3 shows, the underestimations do not take place in Scenario two. In addition, we use the gaseous pollutants in January 2018 and December 2017/February 2018 as inputs to train RF and RNN, respectively. The results are similar: the prediction results of PM10 in January 2019 using the data in January 2018 for training are greater than that using the data in February 2018 for training. The simulations of RF and RNN both underestimate the PM10 level in all nine cities when using the data in December 2017 and February 2018 for training, similar to Scenario one, indicating this is a ubiquitous phenomenon.

Table 3 Monthly average observed and predicted PM10 in January of 2019: Scenario one.
Figure 3
figure 3figure 3

Observed and simulated PM10 in January 2019: Scenario two.

Two insidious causes account for this. The major reason is the chemical processes of sulfur dioxide forming sulfate and nitrogen dioxide forming nitrate are exothermic. Since the temperature in January is lower than that in December and February, the thermodynamic equilibrium shifts lopsidedly in favor of augmenting PM10 in January. Moreover, indigenous flora plays an important role for the removal of PM1029,30,31,32. As the leaf area index dwindles and the metabolism of trees slows down with the decrease of temperature, the change of phenology of indigenous plants is the minor reason for severity of PM10 in wintertime.

Insignificance of long-range transport

The motivation of this work is partially stimulated by the sizzling debates in several previous studies33,34,35,36,37. Guo et al.33 inferred that primary emissions and regional transport of PM in Beijing were insignificant in spawning haze. Li et al.34 demurred to Guo et al.33, insisting that long-range transport was the major cause of severe haze in Beijing. Zhang et al.35 contended that the back trajectory analysis by Li et al.34 was unsuitable for urban-scale investigations and polluted periods in Beijing were typically linked to stagnant conditions with weak and variable winds. Cao and Zhang36 criticized Guo et al.33 for ignorance of non-fossil emission sources, such as biomass burning, cooking, and biogenic emissions. Zhang et al.37 was opposed to Cao and Zhang36, stating that there was little evidence showed that the biogenic source is an ascendant contributor to severe urban PM pollution worldwide. According to Ni et al.38, when the horizontal transportation of air pollutants exceeds 300 km, it is considered as long-distance transport.

Machine learning can give an assessment to this argument. The gestations of the haze can be ascribed to crescendo of gaseous precursors, increase of primary emission, or long-range transport. The lifespans of SO2 and NOx are short33. The gaseous air pollutants and solid PM10 have different physical characteristics, making them unlikely to transport together for a long distance. Hence, our theory to judge the causes of the ups and downs of PM10 level is: when using gaseous air pollutants (SO2, NO2 and CO) as inputs, if RF and RNN catch the maximum, the high episodes were induced by the increase of secondary inorganic aerosols or change of primary sources; otherwise, it’s elicited by long-range transport.

The average of the monthly average discrepancy between simulation and observation in Scenario two is less than 15% of the observation. Hence, RF and RNN catch the undulations of PM10 using only gaseous air pollutants as inputs, indicating the insignificance of long-range transport. In urban areas of China, fugitive dust from roads, construction sites, and unpaved soil sources normally account for 30%-50% of PM10, which is referred as primary PM1039. CO is a presumable indicator for primary PM10. The sporadic sandstorm may induce the long-range transport of PM10 from the far-flung deserts in the northwestern China40. RF and RNN catch all the fluctuations of PM10 using gaseous air pollutants as inputs, indicating the long-range transport induced by spasmodic sandstorm did not occur in January 2019. Thus, we second and shore up the viewpoints of Guo et al.33.

Conclusion

Air pollution has become a hot button in China in recent years. In this work, we take a deeper insight into PM10. To wrap up, we deduce the following conclusions. We find that PM10 was more statistically correlated to the gaseous air pollutants (SO2, NO2 and CO) than meteorological conditions. The spatial heterogeneity and temporal homogeneity of PM10 in China are quantitatively chronicled, signifying each city had its own unique PM10 pattern. RNN and RF are able to accurately predict hourly PM10 using only SO2, NO2 and CO as inputs. The long-range transported PM10 was insignificant for haze. The severity of PM10 was impacted by the lopsided shift of thermodynamic equilibrium and the phenology of local flora.