Calibration of miniature air quality detector monitoring data with PCA–RVM–NAR combination model

Liu, Bing; Zhang, Yirui

doi:10.1038/s41598-022-13531-4

Calibration of miniature air quality detector monitoring data with PCA–RVM–NAR combination model

Article
Open access
Published: 04 June 2022

Volume 12, article number 9333, (2022)
Cite this article

Download PDF

You have full access to this open access article

Scientific Reports

Calibration of miniature air quality detector monitoring data with PCA–RVM–NAR combination model

Download PDF

Bing Liu¹ &
Yirui Zhang²

1420 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

The development of miniature air quality detectors makes it possible for humans to monitor air quality in real time and grid. However, the accuracy of measuring pollutants by miniature air quality detectors needs to be improved. In this paper, the PCA–RVM–NAR combined model is proposed to calibrate the measurement accuracy of the miniature air quality detector. First, correlation analysis is used to find out the main factors affecting pollutant concentrations. Second, principal component analysis is used to reduce the dimensionality of these main factors and extract their main information. Thirdly, taking the extracted principal components as independent variables and the observed values of pollutant concentrations as dependent variables, a PCA–RVM model is established by the relevance vector machine. Finally, the nonlinear autoregressive neural network is used to correct the error and finally complete the establishment of the PCA–RVM–NAR model. Root mean square error, goodness of fit, mean absolute error and relative mean absolute percent error are used to compare the calibration effect of PCA–RVM–NAR model and other commonly used models such as multiple linear regression model, support vector machine, multilayer perceptron neural network and nonlinear autoregressive models with exogenous input. The results show that, no matter which pollutant, the PCA–RVM–NAR model achieves better calibration results than other models in the four indicators. Using this model to correct the data of the miniature air quality detector can improve its accuracy by 77.8–93.9%.

Application of combined model of stepwise regression analysis and artificial neural network in data calibration of miniature air quality detector

Article Open access 05 February 2021

Analysis and prediction of air quality in Nanjing from autumn 2018 to summer 2019 using PCR–SVR–ARMA combined model

Article Open access 11 January 2021

A data calibration method for micro air quality detectors based on a LASSO regression and NARX neural network combined model

Article Open access 27 October 2021

Find the latest articles, discoveries, and news in related topics.

Artificial Intelligence

Introduction

Certain air pollutants, such as PM_2.5, PM₁₀, CO, NO₂, SO₂, O₃ ("two dusts and four gases") can affect human health and cause respiratory diseases and cardiovascular diseases^1,2,3. According to statistics, more than 3 million people die worldwide due to air quality problems every year^4,5. Therefore, obtaining air pollutant concentration information is very necessary to control air pollution and prevent health problems caused by air pollution.

Air quality monitoring platform

Many large cities in developed countries have established some air quality monitoring stations (national control points) in order to obtain information on the concentration of air pollutants. The concentrations of pollutants monitored by these air quality monitoring stations are relatively accurate. However, due to the high cost of establishing monitoring stations and high maintenance costs, the deployment of monitoring stations is relatively sparse. Another disadvantage of national control point monitoring is that the release of data is delayed, making it difficult to monitor the concentration of air pollutants in the entire region in real time. The development of miniature air quality detectors effectively overcomes these shortcomings of reference monitoring stations. The miniature air quality detector has low production and maintenance costs and is easy to install, so it can realize grid deployment and control of specific areas. For this specific areas where the miniature air quality detector is installed for the convenience of monitoring, this paper calls them self-built points. Another advantage of the miniature air quality detector is that it is easy to read the readings, so it can realize real-time monitoring of the concentration of air pollutants^6,7. In addition, while monitoring the concentration of air pollutants, it can also monitor meteorological parameters such as temperature, humidity, wind speed, air pressure, and precipitation in the region.

Electrochemical sensors are one of the core components of many miniature air quality detectors. It works by reacting with the measured gas and producing an electrical signal proportional to the gas concentration. The gas reacts with the sensor through the tiny capillary-shaped openings and reaches the electrode surface, so that an appropriate amount of gas reacts with the sensing electrode to form a sufficient electrical signal, and finally achieve the purpose of monitoring. The miniature air quality detector will have zero drift or span drift after a period of use. In addition, unconventional pollutants in the air, weather factors, etc. will also cause errors in the measurement of the miniature air quality detector⁸. Therefore, it is very meaningful to establish a pollutant concentration prediction model to calibrate the self-built point data.

Introduction to air quality prediction model

At present, many researchers have studied air quality prediction models. The main research methods are divided into two categories: chemical mechanism prediction and statistical model prediction. The chemical mechanism prediction is to quantitatively describe the changes of atmospheric pollutants in a certain area by using the numerical method of atmospheric dynamics and comprehensively considering the atmospheric physical and chemical mechanism^9,10. Chemical mechanism prediction has the advantages of multi-scale and openness, but the main disadvantage is that the uncertainty of pollutant emission sources is large, the calculation time is long, and the prediction accuracy is not high. Statistical model forecasting first uses statistical methods to screen out meteorological factors that are strongly correlated with air pollution concentrations, and then uses statistical models to establish quantitative relationships between them and air pollution concentrations. Statistical model forecasting has the advantages of simplicity and economy, good forecast timeliness and accuracy, so it is widely used in air quality forecasting.

Traditional statistical forecasting models include Multiple Linear Regression (MLR) model^11,12,13, grey models¹⁴, hidden Markov models^15,16, time series models^17,18 and so on. These traditional models are simple in structure, strong in interpretability and short in operation time, and are often used in air quality forecasting in recent years. Suriano et al. designed and developed the SentinAir system for field evaluation of sensor performance. In order to evaluate the system function and capability, indoor and outdoor experiments were performed independently. Linear regression (LR) and multiple linear regression models were used to calibrate the ten sensor data. The results show that the calibration effect of the MLR model is better than that of the LR model because it allows the quantification of the interfering effects of temperature, relative humidity and other gases¹⁹. However, the factors affecting air quality are complex, and it is difficult for these models to accurately reflect the nonlinear relationship between various factors and air quality. With the rise of big data and artificial intelligence, artificial neural networks^20,21,22 have also been used to predict air quality. Arsic et al. used multiple regression analysis and artificial neural network to predict ground-level ozone concentrations in the close vicinity of the city of Zrenjanin (Serbia). The comparison results show that the artificial neural network has a better effect in monitoring the ozone concentration than the multiple linear regression model²³.

Although the prediction effect of artificial neural network is good, neural network usually requires more data than traditional machine learning algorithms, and the output results are difficult to interpret. Random forest algorithm^24,25,26 is also commonly used to predict air quality in recent years, but random forest is prone to overfitting in some noisy regression or classification problems. Support Vector Machine (SVM) can cleverly solve small sample, high-dimensional, nonlinear problems, and it follows the principle of structural risk minimization. Suarez Sanchez et al. used 2006–2008 experimental data on air pollutants to create a highly nonlinear model of the air quality in the Aviles urban nucleus (Spain) based on SVM techniques²⁷. Liu et al. successfully predicted the concentration of air pollutants in Nanjing with the help of support vector regression machine, and calibrated the measurement data of the miniature air quality detector²⁸.

However, for the air quality prediction problem, the support vector machine model also has certain shortcomings. First, as the dimension of training samples increases, the model prediction time is prolonged, which seriously restricts the timeliness of the model. Second, there are many parameters in the principle of the support vector regression machine^29,30. In addition to the kernel function parameters, the penalty factor C and the radius of the insensitive loss area $\varepsilon$ will have a greater impact on the accuracy of the model, and it is difficult to establish a high-precision air quality prediction model. To address these issues, a Bayesian framework-based sparse probabilistic learning model relevance vector machine (RVM) is introduced in this paper to predict air quality. The relevance vector machine uses the active correlation decision theory to realize the sparseness of the model, which greatly reduces the amount of calculation, and the time of model prediction is better controlled. In addition, some model parameters can obtain the optimal solution through self-adaptive iteration, and there are few adjustment parameters, which is convenient for model optimization.

The main work of this paper is to find out the influencing factors affecting the concentration of six types of air pollutants through correlation analysis, and then use Principal Component Analysis (PCA) to extract the main information in these influencing factors. Then, these main information are used as input, the concentration of pollutants in the air is used as output, and the air quality prediction model is established with the help of relevance vector machine. Finally, the prediction residuals are corrected by the Nonlinear Autoregressive (NAR) neural network to further improve the prediction accuracy of the model. We call this combined model the PCA–RVM–NAR combined model. In practical applications, this model has achieved good results in air quality prediction, and it can provide a reference model for air quality prediction and data calibration of miniature air quality detectors.

Material and methods

Data source and preprocessing

The emergence and development of miniature air quality detectors provide the possibility for grid and real-time monitoring of air quality. However, its measurement is affected by many factors, so the measurement data will have errors. This paper uses a statistical model to calibrate it. A total of two sets of data (http://www.mcm.edu.cn/html_cn/node/b0ae8510b9ec0cc0deb2266d2de19ecb.html) are used in this study to establish the calibration model of the miniature air quality detectors. The first set of data comes from an air quality monitoring station in Nanjing, which contains 4200 sets of data and is considered accurate in this paper. It recorded the hourly concentration of two dusts and four gases from November 14, 2018 to June 11, 2019. The second set of data is provided by a miniature air quality detector juxtaposed with the air quality monitoring station. It contains 234,717 sets of data, and the interval between each set of data is no more than five minutes. The second set of data includes not only the concentration of two dusts and four gases, but also five meteorological parameters such as temperature, humidity, wind speed, air pressure, and precipitation.

Data preprocessing is the first step in establishing the data correction model of the miniature air quality detector. Data that is more than 3 times the mean value of the left and right nearest neighbors is regarded as an outlier and eliminated in this paper. Then average the measured values of the self-built point within an hour to compare with the data of the national control point, and delete some data that cannot correspond to the self-built point and the national control point. After data preprocessing, a total of 4135 sets of corresponding data are obtained, and Table 1 shows them.

Table 1 Descriptive statistics of pollutant concentrations and meteorological parameters measured by national control point and self-built point after pretreatment.

Full size table

Data exploratory analysis

Exploratory analysis of data can give us a deeper understanding of the interrelationships between variables. In order to more intuitively reflect the relationship between the national control point measurement data and the self-built point measurement data, we average the measurement data on a daily basis and conduct visual analysis^6,9. It can be seen from Fig. 1 that no matter what kind of pollutants it is, the general trend of the self-built point data and the national control point data is the same, but there are also certain errors. The difference between PM_2.5 and PM₁₀ is relatively small, indicating that the miniature air quality detector has high accuracy in measuring the concentrations of these two types of pollutants. The errors of NO₂ and O₃ in the previous period are relatively large, and the errors in the latter period are relatively small. It may be that the climate has a great influence on the concentration of these two pollutants measured by the miniature air quality detector. The measurement errors of CO and SO₂ are large, indicating that it is difficult to monitor the concentrations of these two pollutants with a miniature air quality detector.

Figure 2 is a line chart of the changes of five meteorological parameters with time. It can be seen that there is abundant rain in this area, the daily average temperature is relatively mild, the daily average air pressure is stable between 1000–1050 Pa, and the air humidity and wind speed change more obviously. Further discussion and analysis are needed to find the relationship between meteorological parameters and the concentrations of the six types of pollutants.

The measurement error of the miniature air quality detector may have a certain relationship with the meteorological parameters, and there are obvious differences in the meteorological parameters in different seasons. We have drawn a boxplot³¹ of the six categories of pollutants by season as shown in Fig. 3. It can be seen that the concentrations of PM_2.5, PM₁₀, CO, and SO₂ are the highest in autumn and winter. The main reason is that the temperature in autumn and winter is lower, and it is difficult for the lower air and upper air to generate convection, resulting in slower diffusion of pollutants. The reason for the high concentration of O₃ in summer is the strong solar radiation and high temperature in summer, which is easy to cause photochemical smog and secondary ozone production. The slightly higher NO₂ concentration in spring may be related to thunderstorms. In addition, the errors between the measured and actual values of the six types of pollutants have obvious differences in the four seasons, indicating that meteorological parameters will affect the measurement of the miniature air quality detector.

Correlation analysis

The concentration of pollutants in the air is an important criterion for evaluating air quality. Different geographical environments have different influence factors on the concentration of air pollutants. The Pearson correlation coefficient is used in this paper to screen the main factors affecting air quality^25,32. Equation (1) is its expression, where $x_{i}$ is the value of the first variable, $y_{i}$ is the value of the second variable, $\overline{x}$ is the mean of $x$, $\overline{y}$ is the mean of $y$, and $n$ represents the number of samples. The value range of the Pearson correlation coefficient is [− 1, 1], and the larger its absolute value, the stronger the correlation between the two variables.

It can be seen from Table 2 that under the premise of the significant level 0.05, except for NO₂ concentration and temperature, the other variables are significantly correlated with each other. The positive correlation between PM_2.5 concentration and PM₁₀ concentration is the highest, and the correlation coefficient is 0.89, indicating that they have the same trend of change. The negative correlation between temperature and air pressure is the highest, and the correlation coefficient is − 0.85, indicating that there is a reverse trend between them.

$$ r = \frac{{\mathop \sum \nolimits_{i = 1}^{n} (x_{i} - \overline{x})(y_{i} - \overline{y})}}{{\sqrt {\mathop \sum \nolimits_{i = 1}^{n} (x_{i} - \overline{x})^{2} } \cdot \sqrt {\mathop \sum \nolimits_{i = 1}^{n} (y_{i} - \overline{y})^{2} } }} $$

(1)

Table 2 Pearson linear correlation coefficient between the concentrations of six types of air pollutants measured at national control point and five meteorological parameters measured at self-built point (Band * indicates significant correlation at a significant level of 0.05).

Full size table

Establishment of sensor calibration model

Introduction to basic principles

The relevance vector machine is a sparse probability model similar to the support vector machine proposed by Tipping in 2000. It is a new supervised learning method. The model combines theories such as Markov's, Bayes's principle and maximum likelihood. Due to the high sparsity of the algorithm and the structure based on probabilistic learning, RVM can enable us to obtain high prediction accuracy. In addition, compared with the support vector machine, it greatly reduces the number of kernel functions involved in the prediction calculation and reduces the prediction calculation time. RVM also has the advantages of probabilistic prediction, automatic parameter setting and arbitrary use of kernel functions^33,34,35.

$$ t_{n} = y(x_{n} ;\;\omega ) = \mathop \sum \limits_{n = 1}^{N} \omega_{n} k(x,\;x_{n} ) + \varepsilon_{n} $$

(2)

$$ p\left( {\left. {t_{n} } \right|x_{n} } \right) = N\left( {\left. {t_{n} } \right|y(x_{n} ),\;\sigma^{2} } \right) $$

(3)

$$ p\left( {\left. t \right|\omega ,\;\sigma^{2} } \right) = \left( {2\pi \sigma^{2} } \right)^{{ - \frac{N}{2}}} \exp \left\{ { - \frac{1}{{2\sigma^{2} }}\left\| {t - \Phi \omega } \right\|^{2} } \right\} $$

(4)

$$ p\left( {\left. \omega \right|\alpha } \right) = \mathop \prod \limits_{n = 0}^{N} N\left( {\left. {\omega_{n} } \right|0,\;a_{n}^{ - 1} } \right) $$

(5)

$$ p\left( {\left. {t_{*} } \right|t} \right) = \int p\left( {\left. {t_{*} } \right|\omega ,\;\alpha ,\;\sigma^{2} } \right)p\left( {\left. {\omega ,\;\alpha ,\;\sigma^{2} } \right|t} \right) \times d\omega d\alpha d\sigma^{2} $$

(6)

The relevance vector machine is not constrained by the Mercer condition when selecting the kernel function, it can achieve binary classification and probability output, and the running speed is fast. Let the training data samples be $\left\{ {x_{n} ,\;\left. {t_{n} } \right|n = 1,\;2, \ldots ,\;N} \right\}$, where $x_{n}$ is the input value,${ }t_{n}$ is the output value, $N$ is the number of data samples, Eq. (2) is the expression of the regression model, where $k(x,\;x_{n} )$ is the kernel function, $\omega = \left\{ {\omega_{n} } \right\}_{n = 0}^{N}$ is the weight value of each input quantity, $\varepsilon_{n}$ is the data noise and obeys the Gaussian distribution, $\varepsilon_{n} \sim N(0,\;\sigma^{2} )$, $ \sigma^{2}$ is an unknown quantity. Thus, the Eq. (3) that satisfies the Gaussian distribution is obtained, where $t_{n}$ is related to $y(x_{n} )$ and $\sigma^{2}$, and $t_{n}$ is independent of each other. Equation (4) is the likelihood function of the training sample set, where $t = \{ t_{1} ,\;t_{2} , \ldots ,\;t_{N} \}^{T}$,${ }\omega = \{ \omega_{0} ,\;\omega_{1} , \ldots ,\;\omega_{N} \}^{T}$,$ \Phi = \left[ {\varphi \left( {x_{1} } \right),\;\varphi \left( {x_{2} } \right), \ldots ,\;\varphi \left( {x_{N} } \right)} \right]^{T}$ is an $N \times \left( {N + 1} \right)$ matrix, and the expression of each column in the matrix is $\varphi (x_{n} ) = [1,\;k(x_{n} ,\;x_{1} ), \;k(x_{n} ,\;x_{2} ), \ldots ,\;k(x_{n} ,\;x_{N} )]^{T}$. The hyperparameter $\alpha = \{ \alpha_{0} ,\;\alpha_{1} , \ldots ,\;\alpha_{N} \}^{T}$ is introduced to solve $\omega$ and $\sigma^{2}$ in Eq. (4), $\omega_{n}$ satisfies the Gaussian distribution, and its expression is Eq. (5). Equation (6) is the expression of the input value $x_{*}$ and the output value $t_{*}$ of the prediction data set. According to the Bayesian and Markov properties and Eq. (6), Eq. (7) can be obtained by simultaneous simplification, where Eqs. (8)–(10) represent the covariance and weight mean.

$$ p\left( {\left. \omega \right|t,\;\alpha ,\;\sigma^{2} } \right) = (2\pi )^{{ - \frac{N + 1}{2}}} \left| {\Sigma } \right|^{{ - \frac{1}{2}}} \exp \left\{ { - \frac{1}{2}(\omega - \mu )^{T} {\Sigma }^{ - 1} (\omega - \mu )} \right\} $$

(7)

$$ {\Sigma } = (\sigma^{2} \Phi^{T} \Phi + A)^{ - 1} $$

(8)

$$ \mu = \sigma^{ - 2} \sum \Phi^{T} t $$

(9)

$$ A = diag\left( {a_{0} ,\;a_{1} , \ldots ,\;a_{N} } \right) $$

(10)

Equation (11) can be obtained after the maximum likelihood function is simplified. Find the partial derivative of $\alpha$ and $\sigma^{2}$ in Eq. (11), and let them be 0 to establish two equations. After simplification, Eqs. (12)–(13) can be obtained, where $\gamma_{n} = 1 - \alpha_{n} {\Sigma }_{nn}$, ${\Sigma }_{nn}$ is the element of row n and column n of ${\Sigma }$. $\alpha$ and $\sigma^{2}$ are obtained through the update iteration of Eqs. (12)–(13). At the same time, the weight posterior mean $\mu$ and the covariance matrix ${\Sigma }$ change continuously until the convergence condition or the maximum number of iterations is satisfied. In the iterative process, new optimal solutions $\alpha_{MP}$ and $\sigma_{MP}^{2}$ will be obtained, and most of the weights will approach 0, and the corresponding basis functions will be ignored, which reflects the sparsity of the RVM model, and other weights will approach a constant, and the corresponding basis functions are called relevance vectors. The expected value $y_{*}$ and the noise variance $\sigma^{2}$ (Eqs. (14)–(15)) can be obtained by predicting the relationship between the input value $x_{*}$ and the output value $t_{*}$ of the data set (Eq. (6)), where $x_{*}$ is the sample to be predicted, $y_{*}$ is the mean of the output value $t_{*}$.

$$ p(\left. t \right|\alpha ,\;\sigma^{2} ) = \int p(\left. t \right|\omega ,\;\sigma^{2} )p(\left. \omega \right|\alpha )d\omega = (2\pi )^{{ - \frac{N}{2}}} \left| {\Sigma } \right|^{{ - \frac{1}{2}}} \exp \left[ {\frac{1}{2} \times (\omega - \mu )^{T} {\Sigma }^{ - 1} (\omega - \mu )} \right] $$

(11)

$$ \alpha_{n}^{new} = \frac{{\gamma_{n} }}{{\mu_{n}^{2} }} $$

(12)

$$ (\sigma^{2} )^{new} = \frac{{\left\| {t - \Phi \mu } \right\|^{2} }}{{N - \mathop \sum \nolimits_{n}^{N} \gamma_{n} }} $$

(13)

$$ y_{*} = \mu^{T} \varphi (x_{*} ) $$

(14)

$$ \sigma_{*}^{2} = \sigma_{MP}^{2} + \varphi (x_{*} )^{T} {\Sigma }\varphi (x_{*} ) $$

(15)

PCA–RVM model construction

Air quality is affected by a variety of factors, and the relationship between the influencing factors is intricate. The variables input to the model have a great relationship with the accuracy of prediction. According to the previous correlation analysis, it can be seen that the pollutant concentration measured by the miniature air quality detector and the five meteorological parameters are significantly related to the air quality, so they all have a certain impact on the air quality. In addition, since the input variables also affect each other, if all variables are directly input into the relevance vector machine, some repetitive information will be input into the model, which not only makes the training time of the model longer, but also makes the model generalization ability deteriorates.

Principal component analysis is a method of data dimensionality reduction and denoising. It converts a series of components that are originally related in the system into several uncorrelated components through orthogonal transformation, and this group of components after conversion is called the principal component. Then, according to the contribution of each component to the data system, the principal components are recombined to highlight the hidden features in the original data to construct a mapping matrix, and then the original data is transformed by the mapping matrix to achieve the purpose of denoising²⁸. The process of principal component analysis is generally as follows: (i) Standardize the original data; (ii) Calculate the correlation coefficient matrix R; (iii) Calculate the eigenvalues and eigenvectors; (iv) Select $p$ $(p \le m)$ principal components and calculate the comprehensive evaluation value. In this paper, the principle of extracting the number of principal components is that the cumulative contribution rate exceeds 99%.

Figure 4 shows the principal component contribution rate and the principal component cumulative contribution rate after dimension reduction by principal component analysis. It can be seen that the contribution rate of the first principal component reaches 29.2%, and the contribution rate of the second, third and fourth principal components also exceeds 10% respectively, and the cumulative contribution rate of the first four principal components exceeds 70%. In addition, it can be seen from the broken line in the figure that the cumulative contribution rate of the first 9 principal components has exceeded 99%, which is in line with the principle of the number of extracted principal components. It shows that PCA is effective for dimensionality reduction of air quality data, and can provide more reliable input for subsequent prediction.

After the principal component dimension reduction is performed on the original data, the first 9 principal components after dimension reduction are used as input independent variables, and the predicted values of six types of pollutant concentrations are used as output variables, and the relevance vector machine is used to build the air quality prediction model. This combined model is called the PCA–RVM model in this paper. Since the construction process of the six types of pollutant prediction models is similar, we take PM_2.5 concentration as an example, and other pollutant concentration prediction models can be obtained similarly.

We randomly divided 4135 groups of data, 3000 groups are selected as the training set, and the other 1135 groups are selected as the test set, and used Matlab2016a for modeling. For the training of the RVM model, according to the RVM regression principle, it can be seen that the hyperparameter $\alpha$ and the noise $\sigma^{2}$ are not sensitive to the initial value, and the optimal value can be obtained by iterative adaptation. The kernel function of the relevance vector machine uses the Gaussian kernel function, because the Gaussian kernel function can obtain a very smooth estimation^36,37. The value of the model kernel function width $\gamma$ is obtained by the grid optimization method, the optimization interval is [0.5, 10], and the step size is 0.5. Equation (16) is the expression of Root Mean Square Error (RMSE), where $y_{i}$ represents the target value, $w_{i}$ represents the model predicted value. In this paper, the RMSE between the target value of the sample training set and the model predicted value is used as the objective function for optimization. During the training process, for each parameter value, we train the model 10 times, and average the output values of the 10 training times as the final output of the model. Through empirical analysis, when γ = 1.5 is the optimal value, the PCA–RVM air quality prediction model is established.

$$ RMSE = \sqrt {\frac{1}{n}\mathop \sum \limits_{i = 1}^{n} (y_{i} - w_{i} )^{2} } { } $$

(16)

PCA–RVM–NAR model construction

The PCA–RVM model can be used to calibrate the miniature air quality detector data. It can be seen from Fig. 5 that the residual of the PM_2.5's PCA–RVM model is greatly improved compared to the residual of the self-built point, whether it is the training set or the test set. In the training set, the residual of the model is concentrated in [− 10, 10], and the absolute value of the maximum residual is 32.06 μg/m³, while the residual of the self-built point is concentrated in [− 40, 20], and the absolute value of the maximum residual is 110.44 μg/m³. In the test set, the residual of the model is concentrated in [− 20, 20], and the absolute value of the maximum residual is 67.2 μg/m³, while the residual of the self-built point is concentrated in [− 50, 25], and the absolute value of the maximum residual is 90 μg/m³. The PCA–RVM model performs well in both the training set and the test set, indicating that the generalization ability of the model is good.

Although the PM_2.5 concentration prediction effect of the PCA–RVM model is good, a set of time series residual data is obtained, and some residuals in the model are still high. Autoregressive integrated moving average model and NAR neural network model are commonly used to deal with time series data. This paper uses a NAR neural network to further mine the residual information.

The NAR neural network belongs to the dynamic neural network and can be expressed by Eq. (17), where y(t) is the output value at the current moment, $y(t - 1),\;y(t - 2), \ldots ,\;y(t - d)$ are the output value at the historical moment, and $d$ is the delay order. NAR neural network consists of input layer, hidden layer and output layer³⁸. For the selection of the number of neurons in the hidden layer and the order of input delay, we also use grid optimization to optimize in [5, 15] × [1, 5]. The training function of the NAR neural network adopts the default Levenberg–Marquardt (LM) algorithm in the Neural Net Time Series in Matlab. The core idea of the LM algorithm is to use the Jacobian matrix to replace the solution of the positive definite matrix in the gradient learning algorithm to optimize the operation efficiency of the training network. For the objective function, RMSE is also chosen as the objective function, and the final output is also obtained by averaging 10 times of training. After optimization, it is found that the optimal value is when the number of neurons in the hidden layer is 9 and the delay order is 3. The structure of the NAR neural network is shown in Fig. 6, where ${\text{w}}$ is the weight of the neural network model, and ${\text{b}}$ is the threshold of the neural network model. The PCA–RVM–NAR air quality prediction model has now been constructed.

$$ y(t) = f(y(t - 1),\;y(t - 2), \ldots ,\;y(t - d)) $$

(17)

Figure 7 shows the measured value of PM_2.5 concentration at the national control point and the predicted value of PCA–RVM–NAR combined model. It can be seen that the change trend of the two is consistent, and the correlation coefficient between the measured value of the national control point and the predicted value of the PCA–RVM–NAR model is greater than 0.95 in both the training set and the test set. Both models in the training set and the test set passed the significance test at the significance level of 0.01. The regression coefficients in the two regression models are also close to 1, indicating that the PCA–RVM–NAR model is more accurate in PM_2.5 concentration prediction.

Residual analysis is also a necessary step in statistical modeling^12,15. It can be seen from the residual analysis diagram in Fig. 8 that most of the residuals of the PCA–RVM–NAR model are concentrated in [− 10, 10], and the residuals are evenly distributed near the zero point. The absolute values of residuals at the 172nd and 1481st sample points are larger than 50 μg/m³. We checked the corresponding data, and the PM_2.5 concentration measured at the national control point has changed greatly at this moment, indicating that the measurement residual of the model will increase when the pollutant concentration changes rapidly. In order to better display the residual characteristics of the model, this paper deletes these two points and draws a residual histogram. From the histogram we can see that the residuals are roughly normally distributed. A total of 3981 sets of data residuals are located in [− 10,10], exceeding 96.2%, and only 27 sets of residuals whose absolute value exceeds 20, do not exceed 0.5% of the total. In addition, 91.3% of the data prediction residuals are within 20%, and 73.3% of the data prediction residuals are within 10%.

Discussion

The PCA–RVM–NAR combination model can calibrate the PM_2.5 measurement concentration of the miniature air quality detector, and has achieved good results. In addition, multiple linear regression model, Support Vector Regression machine (SVR), Multilayer Perceptron neural networks (MLP) and Nonlinear Autoregressive models with Exogenous Inputs (NARX) can also calibrate the PM_2.5 measurement concentration of the miniature air quality detector^39,40,41. In order to visually compare the calibration effects of various models, Taylor diagram is used in this paper to compare them.

Taylor diagram is a visual chart that can simultaneously represent three indicators of correlation coefficient, standard deviation and centered root mean square difference. The scatter points in the Taylor diagram represent different models, the radial line represents the correlation coefficient, the horizontal and vertical axes represent the standard deviation, and the dashed line represents the centered root mean square difference. Equation (1), Eqs. (18)–(19) are their expressions, where $y_{i}$ represents the true value, $w_{i}$ represents the model predicted value, $\overline{y}$ represents the mean value of $y$, and $\overline{w}$ represents the mean value of $w$. Taylor diagram can compare the relationship between model indicators from multiple perspectives and dimensions. It can be seen from Fig. 9 that the distance between the self-built point and the observation point (national control point) is the farthest, indicating that the PM_2.5 measurement accuracy of the self-built point is the lowest, and the measurement value of the self-built point needs to be calibrated. Multiple linear regression model, multilayer perceptron neural network and NARX neural network can calibrate the PM_2.5 measurement accuracy of self-built point, but the calibration accuracy needs to be improved. The calibration effect of the support vector machine and the PCA–RVM model is better, but in general, the PCA–RVM–NAR combined model given in this paper performs the best in the calibration of PM_2.5 measurement accuracy.

$$ \sigma = \sqrt {\frac{1}{n}\mathop \sum \limits_{i = 1}^{n} (w_{i} - \overline{w})^{2} } $$

(18)

$$ E^{\prime} = \sqrt {\frac{1}{n}\mathop \sum \limits_{i = 1}^{n} [(y_{i} - \overline{y}) - (w_{i} - \overline{w})]^{2} } $$

(19)

In order to comprehensively compare the accuracy of the PCA–RVM–NAR model with other commonly used air quality prediction models, four commonly used indicators are used to compare the models in this paper^32,39. These four indicators include Root Mean Square Error, Goodness of fit (R²), Mean Absolute Error (MAE) and relative Mean Absolute Percent Error (MAPE). Equation (16), Eqs. (20)–(22) are their expressions, where $y_{i}$ represents the measured values of six types of pollutants in the national control point, and $w_{i}$ represents the predicted values of various prediction models. The comparison of each indicator of two dusts and four gases is shown in Tables 3, 4, 5 and 6. It can be seen that the error of the self-built point is not only the largest in the PM_2.5 measurement concentration, but also the largest in other pollutants. It should be noted that the R² of some pollutants is negative, which is caused by the large measurement error of the self-built point. This indicator is eliminated when the calculation model improves the measurement accuracy. The support vector regression machine is obviously better than the MLR, MLP and NARX models in each evaluation index value, which shows that the SVR is more suitable for the calibration of the monitoring data of the miniature air quality detector. The performance of correlation vector machine is better than that of support vector regression machine in each evaluation indicator, and the PCA–RVM–NAR model proposed in this paper has the best performance in four indicators of six pollutants. The PCA–RVM–NAR model has the lowest improvement in the measurement accuracy of the miniature air quality detector is the RMSE of PM_2.5. The measurement accuracy of this detector improves of the 77.8% considering the self-built point (RMSE = 22.436) and the PCA–RVM–NAR model (RMSE = 4.97). The PCA–RVM–NAR model has the highest improvement in the measurement accuracy of the miniature air quality detector is the MAPE of NO₂. The measurement accuracy of this detector improves of the 93.9% considering the self-built point (MAPE = 2.129) and the PCA–RVM–NAR model (MAPE = 0.13).

$$ R^{2} = 1 - \frac{{\mathop \sum \nolimits_{i = 1}^{n} (y_{i} - w_{i} )^{2} }}{{\mathop \sum \nolimits_{i = 1}^{n} (y_{i} - \overline{y})^{2} }} $$

(20)

$$ MAE = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left| {y_{i} - w_{i} } \right| $$

(21)

$$ MAPE = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left| {\frac{{y_{i} - w_{i} }}{{y_{i} }}} \right| $$

(22)

Table 3 The RMSE of self-built point and various air quality prediction models, in which national control point is used as comparison object.

Full size table

Table 4 The R² of self-built point and various air quality prediction models, in which national control point is used as comparison object.

Full size table

Table 5 The MAE of self-built point and various air quality prediction models, in which national control point is used as comparison object.

Full size table

Table 6 The MAPE of self-built point and various air quality prediction models, in which national control point is used as comparison object.

Full size table

Conclusions

Air quality is related to the quality of human life^3,4. The main pollutants affecting air quality are PM_2.5, PM₁₀, CO, NO₂, SO₂ and O₃. Real-time monitoring of pollutant concentrations is of great help for the government and relevant departments to take corresponding measures to pollution sources. The development of miniature air quality detectors is very helpful for human beings to monitor air quality in real time and grid. However, due to various reasons, the measurement accuracy of the miniature air quality detector needs to be improved. The PCA–RVM–NAR model proposed in this study successfully improved the measurement accuracy of the miniature air quality detector by 77.8–93.9%. In addition, the PCA–RVM–NAR model performs very well on both the training set and the test set, indicating that it has a strong generalization ability. It uses a total of 4135 sets of data, and the data of four seasons are covered in the model, which also shows that the model has good stability. However, air quality is affected by many factors. The PCA–RVM–NAR model does not consider other external factors when it is established. Future work can try to introduce more external factors to improve the accuracy of the model. In addition, the climate in different regions is different, and the suitability of the model in other regions also needs further verification.

Data availability

The data that support the findings of this study are available from the corresponding author B.L. upon reasonable request.

References

Corrigan, A. E., Becker, M. M., Neas, L. M., Cascio, W. E. & Rappold, A. G. Fine particulate matters: The impact of air quality standards on cardiovascular mortality. Environ. Res. 161, 364–369 (2018).
Article CAS PubMed Google Scholar
Poloniecki, J. D., Atkinson, R. W., Deleon, A. P. & Anderson, H. R. Daily time series for cardiovascular hospital admissions and previous day’s air pollution in London. UK. Occup. Environ. Med. 54, 535–540 (1997).
Article CAS PubMed Google Scholar
Qiu, H. et al. Differential effects of fine and coarse particles on daily emergency cardiovascular hospitalizations in Hong Kong. Atmos. Environ. 64, 296–302 (2013).
Article ADS CAS Google Scholar
Brauer, M. et al. Exposure assessment for estimation of the global burden of disease attributable to outdoor air pollution. Environ. Sci. Technol. 46, 652–660 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Akimoto, H. Global air quality and pollution. Science 302, 1716–1719 (2004).
Article ADS CAS Google Scholar
Cordero, J. M., Borge, R. & Narros, A. Using statistical methods to carry out in field calibrations of low cost air quality sensors. Sens. Actuators B Chem. 267, 245–254 (2018).
Article CAS Google Scholar
Spinelle, L., Gerboles, M., Villani, M. G., Aleixandre, M. & Bonavitacola, F. Field calibration of a cluster of low-cost available sensors for air quality monitoring. Part A: Ozone and nitrogen dioxide. Sens. Actuator B Chem. 215, 249–257 (2015).
Article CAS Google Scholar
Castell, N. et al. Can commercial low-cost sensor platforms contribute to air quality monitoring and exposure estimates?. Environ. Int. 99, 293–302 (2017).
Article CAS PubMed Google Scholar
Liu, Q., Liu, Y., Yang, Z., Zhang, T. & Zhong, Z. Daily variations of chemical properties in airborne particulate matter during a high pollution winter episode in Beijing. Acta Sci. Circumst. 34, 12–18 (2014).
CAS Google Scholar
Lu, C. et al. Chemical composition of fog water in Nanjing area of China and its related fog microphysics. Atmos. Res. 97, 47–69 (2010).
Article CAS Google Scholar
Huang, Z. & Zhang, R. Efficient estimation of adaptive varying-coefficient partially linear regression model. Stat. Probab. Lett. 79, 943–952 (2009).
Article MathSciNet MATH Google Scholar
Tai, A. P. K., Mickley, L. J. & Jacob, D. J. Correlations between fine particulate matter (PM_2.5) and meteorological variables in the United States: Implications for the sensitivity of PM_2.5 to climate change. Atmos. Environ. 44, 3976–3984 (2010).
Article ADS CAS Google Scholar
Ayers, G. P. Comment on regression analysis of air quality data. Atmos. Environ. 35, 2423–2425 (2001).
Article ADS CAS Google Scholar
Dun, M., Xu, Z., Chen, Y. & Wu, L. Short-term air quality prediction based on fractional grey linear regression and support vector machine. Math. Probl. Eng. 2020, 1–13 (2020).
Article Google Scholar
Sun, W. et al. Prediction of 24-hour-average PM_2.5 concentrations using a hidden Markov model with different emission distributions in Northern California. Sci. Total Environ. 443, 93–103 (2013).
Article ADS CAS PubMed Google Scholar
Oettl, D., Almbauer, R. A., Sturm, P. J. & Pretterhofer, G. Dispersion modelling of air pollution caused by road traffic using a Markov chain–Monte Carlo model. Stoch. Environ. Res. Risk A 17, 58–75 (2003).
Article MATH Google Scholar
Dong, M. et al. PM_2.5 concentration prediction using hidden semi-Markov model-based times series data mining. Expert Syst. Appl. 36, 9046–9055 (2009).
Article Google Scholar
Elangasinghe, M. A., Singhal, N., Dirks, K. N., Salmond, J. A. & Samarasinghe, S. Complex time series analysis of PM₁₀ and PM_2.5 for a coastal site using artificial neural network modelling and k-means clustering. Atmos. Environ. 94, 106–116 (2014).
Article ADS CAS Google Scholar
Suriano, D., Cassano, G. & Penza, M. Design and development of a flexible, plug-and-play, cost-effective tool for on-field evaluation of gas sensors. J. Sensors 2020, 1–20 (2020).
Article CAS Google Scholar
Wang, Z., Feng, J., Fu, Q. & Gao, S. Quality control of online monitoring data of air pollutants using artificial neural networks. Air Qual. Atmos. Health 12, 1189–1196 (2019).
Article CAS Google Scholar
Kyriakidis, I., Karatzas, K., Kukkonen, J., Papadourakis, G. & Ware, A. Evaluation and analysis of artificial neural networks and decision trees in forecasting of common air quality index in Thessaloniki, Greece. Eng. Intell. Syst. 2, 111–124 (2013).
Google Scholar
Liu, B., Zhao, Q., Jin, Y., Shen, J. & Li, C. Application of combined model of stepwise regression analysis and artificial neural network in data calibration of miniature air quality detector. Sci. Rep. UK 11, 1–12 (2021).
ADS CAS Google Scholar
Arsic, M., Mihajlovic, I., Nikolic, D., Zivkovic, Z. & Panic, M. Prediction of ozone concentration in ambient air using multilinear regression and the artificial neural networks methods. Ozone Sci. Eng. 42, 79–88 (2019).
Article CAS Google Scholar
Zimmerman, N. et al. A machine learning calibration model using random forests to improve sensor performance for lower-cost air quality monitoring. Atmos. Meas. Tech. 11, 291–313 (2018).
Article CAS Google Scholar
Liu, B., Yu, W., Wang, Y., Lv, Q. & Li, C. Research on data correction method of micro air quality detector based on combination of partial least squares and random forest regression. IEEE Access 9, 99143–99154 (2021).
Article Google Scholar
Yu, R., Yang, Y., Yang, L., Han, G. & Oguti, M. RAQ—A random forest approach for predicting air quality in urban sensing systems. Sensors 16, 86–104 (2016).
Article ADS PubMed Central Google Scholar
Suarez Sanchez, A., Garcia Nieto, P. J., Riesgo Fernandez, P., Del Coz Diaz, J. J. & Iglesias Rodriguez, F. J. Application of an SVM-based regression model to the air quality study at local scale in the Aviles urban area (Spain). Math. Comput. Model. 54, 1453–1466 (2011).
Article Google Scholar
Liu, B., Jin, Y. & Li, C. Analysis and prediction of air quality in Nanjing from autumn 2018 to summer 2019 using PCR-SVR-ARMA combined model. Sci. Rep. UK 11, 1–14 (2021).
ADS CAS Google Scholar
Ortiz-Garcia, E. G., Salcedo-Sanz, S., Perez-Bellido, A. M., Portilla-Figueras, J. A. & Prieto, L. Prediction of hourly O₃ concentrations using support vector regression algorithms. Atmos. Environ. 44, 4481–4488 (2010).
Article ADS CAS Google Scholar
Deo, R. C., Wen, X. & Qi, F. A wavelet-coupled support vector machine model for forecasting global incident solar radiation using limited meteorological dataset. Appl. Energy 168, 568–593 (2016).
Article Google Scholar
Wang, X. & Lu, W. Seasonal variation of air pollution index: Hong Kong case study. Chemosphere 63, 1261–1272 (2006).
Article ADS CAS PubMed Google Scholar
Liu, B., Tan, X., Jin, Y. & Li, C. Application of RR-XGBoost combined model in data calibration of micro air quality detector. Sci. Rep. UK 11, 1–14 (2021).
ADS CAS Google Scholar
Li, T. Z., Pan, Q. & Dias, D. Active learning relevant vector machine for reliability analysis. Appl. Math. Model. 89, 381–399 (2021).
Article MathSciNet MATH Google Scholar
Olson, D. A., Riedel, T. P., Offenberg, J. H., Lewandowski, M. & Kleindienst, T. E. Quantifying wintertime O₃ and NO_X formation with relevance vector machines. Atmos. Environ. 11, 1–8 (2021).
Google Scholar
Tipping, M. E. Sparse bayesian learning and the relevance vector machine. J. Mach. Learn. Res. 1, 211–244 (2001).
MathSciNet MATH Google Scholar
Liu, X., Chen, X., Li, J., Zhou, X. & Chen, Y. Facies identification based on multikernel relevance vector machine. IEEE Trans. Geosci. Remote 10, 7269–7282 (2020).
Article ADS Google Scholar
Qin, W., Liu, F., Tong, M. & Li, Z. A distributed ensemble of relevance vector machines for large-scale data sets on spark. Soft Comput. 10, 7119–7130 (2021).
Article Google Scholar
Khojasteh, D. N., Goudarzi, G., Taghizadeh-Mehrjardi, R., Asumadu-Sakyi, A. B. & Fehresti-Sani, M. Long-term effects of outdoor air pollution on mortality and morbidity-prediction using nonlinear autoregressive and artificial neural networks models. Atmos. Pollut. Res. 2, 46–56 (2020).
Google Scholar
Liu, B. et al. A data calibration method for micro air quality detectors based on a LASSO regression and NARX neural network combined model. Sci. Rep. UK 11, 1–12 (2021).
ADS CAS Google Scholar
Karagulian, F., Barbiere, M., Kotsev, A., Spinelle, L. & Borowiak, A. Review of the performance of low-cost sensors for air quality monitoring. Atmosphere 9, 506 (2019).
Article ADS CAS Google Scholar
Samia, A., Kaouther, N. & Abdelwahed, T. A hybrid ARIMA and artificial neural networks model to forecast air quality in urban areas: Case of Tunisia. Adv. Mater. 518, 2969–2979 (2012).
Google Scholar

Download references

Acknowledgements

This work was supported by the key scientific research project in Nanjing Vocational University of Industry Technology (No. YK17-10-02).

Author information

Authors and Affiliations

Public Foundational Courses Department, Nanjing Vocational University of Industry Technology, Nanjing, 210023, China
Bing Liu
School of Intelligent Manufacturing, Sanmenxia Polytechnic, Sanmenxia, 472000, China
Yirui Zhang

Authors

Bing Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yirui Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

B.L. wrote the main manuscript text, and Y.Z. was responsible for data processing and model verification.

Corresponding author

Correspondence to Bing Liu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Liu, B., Zhang, Y. Calibration of miniature air quality detector monitoring data with PCA–RVM–NAR combination model. Sci Rep 12, 9333 (2022). https://doi.org/10.1038/s41598-022-13531-4

Download citation

Received: 21 February 2022
Accepted: 25 May 2022
Published: 04 June 2022
DOI: https://doi.org/10.1038/s41598-022-13531-4
Springer Nature Limited

Calibration of miniature air quality detector monitoring data with PCA–RVM–NAR combination model

Abstract

Similar content being viewed by others

Application of combined model of stepwise regression analysis and artificial neural network in data calibration of miniature air quality detector

Analysis and prediction of air quality in Nanjing from autumn 2018 to summer 2019 using PCR–SVR–ARMA combined model

A data calibration method for micro air quality detectors based on a LASSO regression and NARX neural network combined model