Introduction

Air quality is becoming more and more important. It affects both the natural environment and human health. The relationship between cardiovascular disease, lung cancer, respiratory system disease and air pollution has been confirmed by some documents1,2,3. Real-time monitoring of the concentration of major pollutants (''two dusts and four gases" includes PM2.5, PM10, CO, NO2, SO2, O3) in the atmosphere is becoming more and more necessary for relevant national departments. The national monitoring and control station (national control point) can measure the concentration of pollutants, and the "two dust and four gases" monitoring data of the national control point (ncp) is considered accurate. However, due to cost issues, the number of national control points is small, and it is difficult to meet the requirements for real-time monitoring of air quality. Some miniature air quality detectors (self-built points) are gridded and deployed in some areas. They can realize real-time monitoring of air quality, and can also monitor other meteorological parameters (temperature, humidity, wind speed, pressure and precipitation) in the area. Since the electrochemical sensor used in the self-built point (sbp) will be interfered by external factors, it will cause measurement errors4. We need to use the national control point data to calibrate the self-built point data.

Mechanism models based on atmospheric chemical analysis and statistical models based on machine learning are often used to predict the concentration of pollutants. The former uses meteorological principles and mathematical methods to simulate the chemical and physical processes of pollutants to realize the prediction of pollutant concentration5,6. The latter uses statistical methods to analyze the collected pollutant data and uses mathematical algorithms to model the relationship between variables. For the research based on machine learning models, the main algorithms are artificial neural networks7,8,9, multiple linear regression10,11,12, hidden Markov models13,14, random forest models15,16,17, and support Vector machine18,19,20 and so on.

Artificial neural network (ANN) is an information processing system that simulates human brain thinking and reasoning. It has been a research hotspot in the field of artificial intelligence since the 1980s, and has made certain progress in various research fields. Its advantage is that it has strong nonlinear fitting ability, can map arbitrarily complex nonlinear relationships. Artificial neural networks have strong associative storage capabilities, robustness, non-linear mapping capabilities, and autonomous learning capabilities. However, it turns all the characteristics of the problem into numbers and turns all reasoning into numerical calculations21,22,23, so it has no ability to explain its reasoning process and reasoning process. As a mature method for solving linear problems, multiple linear regression (MLR) has been widely used in various fields. Its advantage is that it is more convenient and simple when analyzing a multi-factor model. If the data used is the same as the model, the calculation result is unique, and each regression coefficient in the model is better explained11,24,25. However, multiple linear regression models have strict requirements on independent variable selection and error terms, and multiple linear regression methods are also greatly restricted in solving nonlinear problems.

Artificial neural networks and multiple linear regression models are widely used in air quality prediction models. The two-step calibration method of multiple linear regression and machine learning was used by Elangasinghe et al. to correct the NO2 concentration measured by the sensor. They compared different machine learning methods through 5 evaluation indicators and gave the best model7. Artificial neural networks are used by Reich, S. L. et al. to identify pollution sources in the air. They chose to use a three-layer feedforward ANN trained by the backpropagation algorithm and successfully repaired some of the data in the model9.Spinelle, L. et al. compared linear/multilinear regression and supervised learning techniques, and carried out on-site calibration of NO, CO and CO2 pollutant sensors10. However, both linear regression and artificial neural network have shortcomings in air quality prediction models26. In this paper, by combining the prediction effects of the two methods in the air quality forecast model data, a calibration model of the main pollutants in the air is given to improve the interpretability and accuracy of the air quality calibration model.

Material and methods

Data source and preprocessing

This article selects 2019 Chinese college students' mathematical modeling D problem data. It provides hourly data of a national control point from November 14, 2018 to June 11, 2019. It also provides a self-built point data corresponding to the national control point (corresponds to the national control point time and the interval is within 5 min). Before conducting exploratory analysis on the data of national control points and self-built points, the data is pre-processed. First, delete the data that the self-built point and the national control point cannot correspond to and the data that is obviously abnormal. Second, the various data within each hour of the self-built point are classified and aggregated and averaged to correspond to the hourly data of the national control point. After data preprocessing, a total of 4135 sets of data were obtained as research objects27. Table 1 shows the range, mean, and standard deviation of each variable.

Table 1 Descriptive statistics of air quality variables from data from national control points and self-built points.

Data exploratory analysis

The establishment of statistical models usually starts with exploratory analysis of the data11,28,29. Based on the national control point data, the “two dusts and four gases” concentration data measured at the self-built points are corrected in this paper. In order to more intuitively reflect the difference between the national control point and the self-built point data, we calculated the daily average value of the preprocessed 4135 sets of data and compared these pollutant concentration data.

In Fig. 1, the blue curve indicates the national control point measurement value, and the red curve indicates the self-built point measurement value. It can be seen that the measurement data of the “two dusts and four gases” concentration national control point and the self-built point are generally consistent, but there is a certain deviation between the two. The deviation between the two in the previous period is significantly larger, which may be caused by the season or the zero drift of the measuring instrument. As the PM2.5, PM10, and O3 concentrations change significantly over time, we draw a box-line diagram10 of the monthly changes in the concentration of the “two dusts and four gases” national control points as shown in Fig. 2.

Figure 1
figure 1

Comparison of daily average data of six types of pollutants at national control points and self-built points.

Figure 2
figure 2

Comparison of monthly average data of six types of pollutants at national control points and self-built points. Figures are generated using Matlab (Version R2016a, https://www.mat-hworks.com/) (software).

It can be seen from Fig. 2: The average PM2.5, PM10, CO, and SO2 concentrations are highest in November, the average NO2 concentration is highest in January, and the average O3 concentration is highest in June. The average PM2.5, CO, and SO2 concentrations are lowest in May, the average PM10 concentration is lowest in June, the average NO2 concentration is lowest in November, and the average O3 concentration is lowest in December. The concentration of "two dusts and four gases" varies significantly in different months, so time is an important factor affecting the concentration of "two dusts and four gases".

Correlation analysis

The quality of air is judged based on the concentration of pollutants in the air1. There are many factors that affect air quality, and they affect each other. In order to determine the correlation between the "two dusts and four gases" concentration and the five climate factors30, we use Eq. (1) to find the Pearson correlation coefficient between them, as shown in Table 2. It can be seen that, except for NO2 concentration and temperature, all other variables have significant correlations with each other, indicating that the factors affecting the concentration of each pollutant are very complex. The correlation coefficient between PM2.5 concentration and PM10 concentration is as high as 0.89, indicating a high positive correlation between the two, and the correlation coefficient between temperature and air pressure is -0.85, which indicates that the higher the temperature, the lower the pressure. Figure 3 is a matrix color block diagram between the concentration of "two dusts and four gases" and five climatic factors, which visually shows the correlation coefficients between the variables. The size of the matrix color block represents the absolute value of the correlation coefficient. As the color becomes lighter, the value of the correlation coefficient gradually increases.

Table 2 Pearson linear correlation coefficients between six types of air pollutant concentrations and climate (band * indicates significant correlation at a significant level of 0.05).
Figure 3
figure 3

Correlation coefficient matrix color block diagram between six types of air pollutant concentrations and climate.

$$r=\frac{\sum_{i=1}^{n}({x}_{i}-\overline{x})({y}_{i}-\overline{y})}{\sqrt{{\sum_{i=1}^{n}({x}_{i}-\overline{x})}^{2}}\bullet \sqrt{{\sum_{i=1}^{n}({y}_{i}-\overline{y})}^{2}}}$$
(1)

Establishment of sensor calibration model

Introduction to basic principles

Artificial neural network is one of the most commonly used methods to predict the concentration of atmospheric pollutants. It has the ability to approximate any non-linear mapping through learning. It has a wide application prospect in the prediction of non-linear systems. The working principle of artificial neural network prediction is mainly divided into two steps: first, use the training samples to design and train the network to obtain prediction rules; then predict the test samples according to the obtained rules to verify its reliability with the accuracy of the test results. The main advantage of artificial neural network algorithms is their strong adaptability to training samples. It has a strong ability to process uncertain information. It can still work normally for the presence of noisy or non-linear data. Artificial neural network has strong robustness, memory ability, non-linear mapping ability and strong self-learning ability in training. It can quickly get prediction results for complex prediction problems. After consulting relevant literature, the most commonly used model in the research and application of neural networks are multilayer perceptron neural network31,32,33.

Multilayer Perceptron (MLP) neural network is a unidirectional propagation multilayer feedforward network structure based on error back propagation algorithm. As shown in Fig. 4: its structure can be divided into three layers, namely the input layer, the hidden layer and the output layer. Each layer of it consists of multiple nodes, and each layer can be passed to the next layer until the output layer. Except for the input nodes, each node is a neuron with a non-linear activation function. Equation (2) is its output, \({\upomega }_{nj}\) is the node weight, and \({b}_{jk}\) is the deviation.

Figure 4
figure 4

Multilayer perceptron neural network structure.

$${o}_{k}=\sum_{j}{\upomega }_{nj}{x}_{n}+{b}_{jk}$$
(2)
$$J(\upomega ,b;x,y)=\frac{1}{2}{\Vert {o}_{\upomega ,b(x)}-y\Vert }^{2}$$
(3)

MLP is a typical supervised learning algorithm, and its loss function is defined as Eq. (3). \({o}_{\upomega ,b(x)}\) is the output value of MLP, and y is the actual value. In this paper, the parameters are adjusted by the conjugate gradient method to minimize the loss function. The conjugate gradient method calculation formulas are Eqs. (4) and (5). The hidden layer in the MLP neural network model can be single or several. However, as long as the number of neuron nodes in the hidden layer is appropriately adjusted, a single hidden layer neural network can approximate any nonlinear function34,35. Therefore, a single hidden layer can meet most engineering needs. In the process of using SPSS software for auxiliary calculation, the number of hidden layer neurons can be automatically calculated by SPSS, and the relatively optimal number of neurons that is most suitable for this model is given.

$$S\left(n+1\right)=-g\left(n+1\right)+\beta (n+1)\times S(n)$$
(4)
$$\beta \left(n\right)=\frac{(-g{(n+1))}^{T}\times (g\left(n\right)-g\left(n+1\right))}{g{\left(n\right)}^{T}\times g\left(n\right)}$$
(5)

The concentration of "two dusts and four gases" is affected by various factors such as various climatic factors and other pollutant concentrations, as well as the sensor's own range drift. The simple regression model can only describe the linear effect of each variable on the concentration of pollutants. The appropriate weighted average of the model by the neural network, and introducing other non-linear effects into the model, can effectively improve the prediction accuracy of the model and improve the correction effect of the self-built point pollutant concentration.

In this paper, we will build a combination model of stepwise regression analysis (SRA) and artificial neural network, called SRA-MLR model. Firstly, a stepwise regression model is established through the influence of various factors on the concentration of pollutants, and the stepwise regression model is used to give the fitted value of each pollutant at the corresponding moment. Then the SRA-MLP neural network model is established by taking the fitted value and other data and time measured by the self-built point as input values and the national control point data as output values. The process of building the model is shown in Fig. 5.

Figure 5
figure 5

The flux diagram of the regression process.

Stepwise regression model construction

We want to establish a multiple regression model with the pollutant concentration at the national control point as the dependent variable and the observation data from the self-built point as the independent variable. The key to establishing a multiple regression model is the choice of independent variables. If too few independent variables are selected, it is easy to miss key variables and the regression effect is not ideal. Too many independent variables are introduced into the model, which is prone to multicollinearity problems, which makes the model very unstable, and even problems such as inversion of sign. Commonly used independent variable selection methods are forward, backward, stepwise method. We use stepwise regression to build the model. The variables introduced in the model and their regression coefficients are given in Table 3.

Table 3 Stepwise regression model and model test of six types of air pollutant concentrations. In the model, the dependent variable is the concentration of the six pollutants at the national control point, and the independent variable is the variable and time monitored by the self-built point (– represents the variables eliminated in the model).

The F-test p-values in the six types of pollutant regression models are all less than 0.01, indicating that at a significant level of 0.01, the variables introduced into the model as a whole have a significant effect on the concentration of pollutants. The t-test p-value of each independent variable introduced into the model is less than 0.05, indicating that at a significant level of 0.05, each independent variable introduced into the model has a significant effect on the concentration of pollutants. The coefficient of determination in the PM2.5 concentration model is 0.908, indicating that the fitting effect is very good; the coefficients of determination in the PM10 and O3 concentration models are all greater than 0.8, indicating that the fitting effect is good; the coefficients of determination in the CO, NO2, and SO2 concentration models are all greater than 0.5, indicating that the fitting effect is acceptable.

SRA-MLP model construction

The miniature air quality detector can not only implement grid-based monitoring of the air quality in the area, but also monitor meteorological parameters such as temperature, humidity, wind speed, air pressure, and precipitation. The fitting values of the air pollutant concentrations of the stepwise regression model and the data from the self-built points were used as covariate factors in the MLP model, and the air pollutant concentrations at the national control point were used as the dependent variables. We use SPSS 20.0 to fit the non-linear relationship between the covariate factors and the dependent variables.

In the MLP neural network, it is particularly important to choose the number of hidden layers and the number of neurons in each layer. In a small data set, too many hidden layers will not only make the model more complicated, but also lead to overfitting of the model and poor model generalization ability. Therefore, in small data sets, one or two hidden layers MLP neural network is generally used for modeling. We establish one hidden layer and two hidden layers MLP models for six types of pollutants, and choose the model with less error as the final prediction model of the pollutants. In the modeling process, 4135 samples are randomly assigned as training samples, test samples, and holdout samples, and the allocation ratio is 7:2:1, and the activation functions of the input layer and output layer adopt hyperbolic tangent function and identity function respectively. The batch is selected as the type of training, and scaled conjugate gradient is selected as the optimization algorithm. The software automatically calculates the number of units in the hidden layer and finally obtains SRA-MLP model36.

$$RMSE=\sqrt{\frac{1}{n}\sum_{t=1}^{n}{({y}_{t}-{w}_{t})}^{2}}$$
(6)
$$MAE=\frac{1}{n}\sum_{t=1}^{n}\left|{y}_{t}-{w}_{t}\right|$$
(7)
$$MAPE=\frac{1}{n}\sum_{t=1}^{n}\left|\frac{{y}_{t}-{w}_{t}}{{y}_{t}}\right|$$
(8)

This article uses root mean square error(Eq. 6), mean absolute error(Eq. 7), and mean absolute percent error(Eq. 8) to determine the final hidden layer number. The specific results are shown in Table 4. It can be seen that in NO2 and O3 prediction models, the two hidden layers MLP model performs better, so NO2 and O3 finally choose the two hidden layers SRA-MLP model. The numbers of neurons in the first and second layers of the NO2 prediction model are 8 and 6, and the numbers of neurons in the first and second layers of the O3 prediction model are 8 and 6. PM2.5, PM10, CO and SO2 finally choose one hidden layer SRA-MLP model, and the number of their hidden layer neurons are 7, 6, 5, and 8. The effect of our randomly selected PM10 prediction model is shown in Fig. 6. It can be seen that the prediction effect of the SRA-MLP model is very good whether it is the training set, validation set or test set.

Table 4 Comparison of neural network errors between one hidden layer and two hidden layers. The first three columns are the model errors of one hidden layer of six types of pollutants, and the last three columns are the model errors of two hidden layers of six types of pollutants.
Figure 6
figure 6

The prediction effect of PM10’s SRA-MLP model on the training set, validation set and test set.

Discussion

In the air quality prediction problem, stepwise regression models, MLP and SRA-MLP models can fit the data of self-built points. We can verify each model by the error between the model prediction value and the national control point data. Obviously, which model has a smaller error between the predicted value and the national control point value, which model is better. This article uses root mean square error, mean absolute error, and mean absolute percent error to evaluate the model30. The specific results are shown in Tables 5, 6 and 7.

Table 5 RMSE of six types of air pollutant concentrations between self-built points, model forecast values and national control point.
Table 6 MAE of six types of air pollutant concentrations between self-built points, model forecast values and national control point.
Table 7 MAPE of six types of air pollutant concentrations between self-built points, model forecast values and national control point.

It can be seen that whether it is a stepwise regression model, or the MLP and SRA-MLP models, the prediction accuracy is better than the measurement accuracy of self-built points. This shows that using the three established mathematical models to calibrate the measurement data of self-built points can achieve better results. Since the error evaluation index of the SRA-MLP model is the smallest among the three models, the SRA-MLP model is selected to calibrate the measurement data of self-built points. Among the six types of pollutant prediction models, the accuracy of the PM10 prediction model's RMSE has the largest increase, with an accuracy increase of 74.4%. The PM10 prediction model's MAE has the largest increase in accuracy, with an accuracy increase of 76.3%. The NO2 prediction model's MAPE has the largest increase in accuracy, with an accuracy increase of 86.5%.

The concentration of pollutants in the atmosphere has an obvious correlation with the periodic activities of human beings. The weekly averages of the six pollutant concentrations are plotted in Fig. 7. It can be seen that there is a significant deviation between the red self-built point data curve and the blue national control point data curve, but the black model fitting value (smp) curve deviates very little from the national control point data curve. The results show that the accuracy of the SRA-MLP model for predicting the concentration of pollutants is better than the accuracy of the self-built point measurement data.

Figure 7
figure 7

Comparison of weekly average data of six types of pollutants at national control points, self-built points and SRM-MLP model correction points.

Conclusions

The air quality index (AQI) is a dimensionless index that quantitatively describes the condition of air quality. It is often used to measure the quality of air quality. The main pollutants participating in the air quality assessment are PM2.5, PM10, CO, NO2, SO2, O3, etc. Therefore, to realize the monitoring of air quality, it is very important to monitor the concentration of ''two dusts and four gases" in real time.

Many countries have established national monitoring and control stations to monitor air pollutant concentrations. Although the national control point is more accurate in monitoring pollutants, the cost of deployment is high, the number of deployments is small, and the maintenance costs are high. Therefore, it is difficult for the national control point to achieve full control. The miniature air quality detector developed by some companies has successfully improved these shortcomings, but the accuracy of monitoring needs to be improved.

The pollutant correction model based on the stepwise regression model has some corrections to the self-built point data, and the results obtained are easier to interpret, but the correction effect needs to be improved. Compared with regression models, artificial neural networks have a greater advantage in data correction. The artificial neural network does not rely on the typical distribution of the original data. It simulates human thinking to derive a non-linear mapping relationship between the input and output of the system, and then makes intelligent reasoning and prediction.

The SRA-MLP model given in this article combines the advantages of a stepwise regression model and an artificial neural network combined model. It not only provides the quantitative relationship between the monitoring data of self-built points and the concentration of the six pollutants, but also greatly improves the accuracy of the prediction of the concentration of the six pollutants. The data used in the model is 4135 groups, the time span is 206 days, and the data of all four seasons are involved, and it shows good predictive ability in the training set and the test set, so the model is very stable. This model plays a positive role in grid-based monitoring of the concentration of various pollutants and guides the scientific deployment of miniature air quality detectors. It can also be popularized and applied to the prediction of environmental pollution indexes such as water pollution, soil pollution, noise pollution and light pollution. But because this research uses a small data set, it is not suitable for deep learning. In future research, we hope to collect more data and use deep learning to improve the model.