Regression methods in the calibration of low-cost sensors for ambient particulate matter measurements

The article presents comparison of regression methods used to obtain calibration formulas for low-cost optical particulate matter sensors. Data for analysis were taken from 1-year collocation study of PMS7003 sensors (Plantower) with research-grade instrument TEOM 1400a. The PM2.5 fraction was considered in this study. The results of measurements showed that PMS7003 was characterized by high reproducibility between units (coefficient of variation was lower than 10%), but the raw sensor outputs significantly overestimated PM2.5 concentrations. Data analysis revealed that simple univariate models were sufficient to obtain a good fitting quality to TEOM data; however, the best results were achieved for raw PM1 outputs (R2 ≈ 0.81). The fitting quality was improved when multi-variable equations were examined (R2 ≈ 0.84). The addition of temperature and relative humidity in the models was also beneficial (R2 ≈ 0.87). Stepwise selection algorithm was used to choose the best subset of variables in the model. The results of that method were compared with “all possible regression” approach, demonstrating the convenience of stepwise regression. Data from Plantower sensor were also used for training of artificial neural network. That algorithm proved to be very effective for fitting data from one sensor (R2 ≈ 0.9), but it was susceptible to deviations in the data from the other units. In general, regression analysis proved to be useful for sensor systems for ambient particulate matter measurements.


Introduction
In recent years, the progress in the field of electrical engineering has led to the expansion of air pollution sensing devices [1,2]. Different sensors are currently available on the market, allowing the measurements of various gaseous species and particulate matter (PM) [3].
Generally, sensor devices are characterized by small size and small weight, relatively low power requirements and short response time [2][3][4]. What is significant, their price is few orders of magnitude lower than the price of traditional air quality measurement instruments. Those so-called low-cost sensors are changing the paradigm of air pollution monitoring, hitherto based on expensive and complex instruments, operated by governmental, industry or research agencies [2,5].
The possibilities of using sensors in the measurements of air quality are very wide. Those inexpensive devices might be used for the improvement in the spatial coverage of ambient air pollution data. Therefore, they can supplement the conventional monitoring stations networks [2,6]. They could also provide data in real time (or near real time) and increase the temporal resolution of measurements. This feature is useful for "hot spot" detection or indication of elevated pollutant events [7,8].
Many low-cost sensors are usually easy to use and often adopted by citizen scientist. This trend is particularly important in raising public awareness about the air pollution [9][10][11]. Compact and lightweight versions of sensors with high-resolution data acquisition could be also used for personal exposure monitoring. Such monitors might be helpful in finding link between short-term pollutant exposures and health effects [11][12][13][14].
The application of low-cost sensors in not limited to atmospheric air only and sensor techniques might be used for indoor air quality assessment as well. Characterization of indoor concentrations, identification of emitting sources and management of ventilation rates and energy are just a few examples of the use of sensors in indoor spaces [15][16][17][18][19].
It should be noted that sensor is always only a constituent of a larger whole-a sensor system [4]. Such system might have a form of a stand-alone monitor (stationary, hand-held, portable, mobile or wearable [10,[20][21][22]) or might be integrated into a node of a widespread network [23][24][25][26]. Sensor system may contain one or many pollutant sensors and often includes additional sensors for temperature and/or relative humidity measurements [4]. Besides the sensing devices, the system includes other, configuration-dependent, elements: housing, sampling probe, power source, control board, data acquisition and data analysis module, data transmission module, positioning system [2,4,23].
In general, sensors have an analogue output (voltage or current) or digital output (e.g. in the form of mass or volume concentration). However, the sensor response might be largely influenced by cross-sensitivities (in case of gas sensors), particles properties (in case of particulate matter sensors) or environmental factors (in both cases). Therefore, data quality is a critical issue in the usage of low-cost sensors and the calibration or recalibration of sensors before the deployment is necessary in many situations [3,4]. The most popular way of such data adjustment is based on a field collocation with a reference-grade or researchgrade instrument [27][28][29][30][31]. During this "training" period, the relationship between raw sensor data and reference data is established and the data correction algorithm is developed [32]. For this reason, the most important part of sensor system might be the data analysis module, where data processing occurs.
Overall, different approaches are used to create calibration formulas. In some cases, simple linear models are sufficient to adjust the raw data [8,33]. In other cases, nonlinear equations or multi-parameter methods are necessary to obtain results close to Ref. [25,30,34,35]. More sophisticated techniques, from the field of machine learning, are also utilized for this purpose [36,37]. This article presents comparison of different algorithms for the adjustment of data from low-cost optical particulate matter sensors. Data for testing have been collected during 1-year collocation study with research-grade instrument (TEOM 1400a) for PM 2.5 measurements. On the basis of the previous analyses [27], PMS7003 sensor from Plantower was chosen for this investigation. This sensor has proved to work stable for several months of measurements, showed high linear correlation with comparison instrument and was precise in terms of reproducibility between units [27].
The paper focuses on the linear regression methods (univariate and multiple regressions); however, comparison with nonlinear algorithm (artificial neural network) was made too.

Measurement site and control instrument
The collocation study took place in Poland at the Meteorological Observatory of Department of Climatology and Atmosphere Protection of University of Wrocław. In the vicinity of the observatory, there are detached houses and allotments and a large municipal park. In this area, the main sources of particulate matter are the individual heating systems in households.
The observatory is equipped with instruments for PM 10 and PM 2.5 measurements (TEOMs); however, operational problems with PM 10 unit have led to the exclusion of this device from analysis. TEOM 1400a analyser is an example of tapered element oscillating microbalance [38]-a research-grade instrument, with the possibility of near real-time monitoring, which proved to be useful for lowcost sensors testing [27,39,40]. TEOM with a PM 2.5 inlet provided 1-min averaged data that were stored in the database.

Measurement set-up for PM sensor
Special measurement box was designed for the purpose of testing different sensors under the same measurement conditions. The box was made from PVC and was equipped with rainproof lid, air inlets and a fan, forcing the air flow. Power suppliers, microcomputer and USB hubs for connecting the sensors were placed inside this enclosure. The measurement set-up included also data logger with temperature and relative humidity (RH) sensor for the measurements of those parameters in the vicinity of PM sensors. The box was placed near TEOM intake (circa 1.5-1.8 m below). Construction details of the measurement box may be found in [27].
PMS7003 (Beijing Plantower Co., Ltd, China) is a small and lightweight sensor (48 × 37 × 12 mm, ~ 30 g), which can be classified as low-cost device (approximate price at the level of 15-20 $). PMS7003 is a light-scattering optical sensor that composes of a small measurement chamber with light-emitting diode, light detector (photodiode) and a set of focusing lenses. This sensor uses also a microfan to induce the flow of air.
According to the PMS7003 datasheet, the minimum detectable particle diameter is 0.3 μm. The sensor contains a microprocessor that provides digital signals in two forms: 1. Mass concentration (µg/m 3 ) of PM 1 , PM 2.5 and PM 10 fractions with correction factor for "factory environment" ("FE") and for "atmospheric environment" ("AE"); 2. Number of particles per unit volume (0.1 l of air) for 6 size bins: beyond 0.3 μm (bin 1), beyond 0.5 μm (bin 2), beyond 1.0 μm (bin 3), beyond 2.5 μm (bin 4), beyond 5.0 μm (bin 5) and beyond 10.0 μm (bin 6). The product manual contains information that particles diameters and the number of particles in size bins are estimated on the basis of light-scattering intensities and lightscattering signal distribution, with the use of Mie theory. It can be deduced that the PM 1 , PM 2.5 and PM 10 concentrations are calculated in the subsequent step. However, the details of the calculations and also the factory calibration procedures and the type of particles used for calibration are not specified in the datasheet.
Three copies of PMS7003 sensor were mounted inside the measurement box and connected via USB hub with microcomputer. Sensor signals were averaged in 1-min intervals and stored in the database for further analysis.

Data preparation and preliminary analysis
Data from TEOM and PM sensors registered from 21/08/2017 to 20/08/2018 were utilized in this study. 1-min averaged TEOM outputs and Plantower signals were used to create a new set of 1-h averaged data. This type of data is usually provided by automated measuring systems [41] from governmental monitoring stations and is very popular in informing the public about the air quality. Averaging was made only for hours with at least 75% completeness of data.
The preliminary analysis covered the evaluation of reproducibility between units of PMS7003 sensor. 1-min and 1-h averaged PM 2.5 outputs with "FE" correction factor were used to calculate the correlations of PMS7003 units and coefficient of variation (CV). Low CV value indicates high reproducibility of sensor units, and CV value below 10% is considered acceptable in the low-cost sensor studies [21,42]. The PM 2.5 output was chosen on the basis of an assumption that it should reflect PM 2.5 concentrations in the best way.
The other aspect of preliminary study was the assessment of sensor signals: mass concentrations and number of particles in bins. Additionally, combinations of bins in form of differences between bin 1 and the other bins were taken into account. Sensor outputs were investigated on the basis of Pearson's correlation coefficient (r).
After the preliminary investigation, the hourly averaged dataset was randomly divided into training set (70% of data) and validation set (30%) that were used to create and evaluate different calibration models. There were 6116 samples in training set and 2621 samples in validation set. Both datasets contained similar range of PM concentrations from TEOM device. All data processing was performed in MATLAB environment.

Evaluation criteria
The performance of calibration equations was assessed for two datasets: training and validation. Two popular goodness-of-fit indicators were used for that purpose: coefficient of determination (R 2 , dimensionless) and root mean square error (RMSE, expressed in µg/m 3 ). The R 2 value near 1 reflects very good agreement with control measurements and small R 2 indicates poor fitting quality. In turn, small value of RMSE demonstrates small error of fitting.

Univariate regression
Generally speaking, regression analysis is used for determining the relationship between two or more variables. Regression model includes the dependent variable (response) and other variables, which are thought to provide information on the behaviour of response-independent variables (also called predictors or explanatory variables) [43]. In this study, the concept of so-called inverse calibration was adopted [44]. TEOM readouts were chosen as dependent variable and PM sensor data as predictor variables.
Firstly, the univariate regression with only one independent variable was tested. On the basis of the previous study [27], the linear models were assumed to be sufficient to describe the relationship between TEOM and sensor signals. The following linear equations were examined: where y denotes the dependent variable (TEOM response), x is the independent variable (one of the sensor outputs: one type of mass concentration or number of particles for size bins or bins combination) and ε is a random term. Regression coefficients a 0 (the intercept) and a 1 (the slope) were estimated by ordinary least squares procedure [43].

Multiple regression
Linear additive models with several independent variables were examined in this study as well. The general equation taken into account had form: where k is the number of independent variables (x 1 …x k ) and a 0 …a k are the regression coefficients.
Two types of models were tested: 1. Model that included different forms of sensor outputs (mass concentrations, particles number in size bins, combination of bins); 2. Model with the mentioned sensor outputs and also with temperature and relative humidity.
In the case of PMS7003, small impact of high levels of RH was previously noticed [27], so the second approach was aimed to assess the validity of including environmental factors in calibration equation.

Variable selection for multiple regression models
The previously described models contain a quite number of variables and some of them may be irrelevant and could be eliminated. Generally, the multi-variable models may be fitted to get simpler formulas, easier to interpret and to implement. Also, the removal of redundant variables may simplify the data acquisition and signal processing. One of the possible strategies for variable selection is stepwise regression [43,44]. In the stepwise selection process, variables are sequentially added or removed from the model, on the basis of their statistical significance. It should be noted that this algorithm finds variable subsets that are locally optimal-the selection of the globally best subset is not guaranteed.
The algorithm applied in this study started from constant (intercept) term and added and removed predictors in subsequent steps. Only linear additive models were examined, and the F-test was employed for judging the importance of variables. Some stringent criteria to obtain fitted models were applied: p value for a term to be added to a model was set to 0.005 and p value for removing variables was equal 0.010.
The results of stepwise regression were compared to the results of "all possible regressions" approach [43], where models with all possible subsets of variables were created and tested. The discussion on the choice of the best subset size was based on two information criteria: Akaike information criterion (AIC) and Bayesian information criterion (BIC). Both criteria are used to find the tradeoff between accuracy of fit and the number of predictors used in the model [43]. The model with the minimum value of AIC or BIC is the most appropriate in relation to the concerned criterion.

Neural networks for PM sensor calibration
Neural network (NN) is a computation system consisting of a number of highly interconnected units (neurons), organized in layers. Each neuron converts received information by means of activation function and produces output value, which might be processed by neurons in the next layer. The most popular NN approach is feedforward network with input, hidden and output layers. The NN training process is based on updating the weights of neurons via supervised learning. After the training, NN gains the unique approximation capabilities [45,46].
Feedforward NN with 10 neurons with sigmoid transfer function in hidden layer and linear output neuron was used in this study. Backpropagation method with Levenberg-Marquardt algorithm was adopted for training. Patterns for learning and testing were taken only from training dataset. Figure 1 presents the results of 1-year PM 2.5 measurements with TEOM control device and PMS7003 sensor. (PM 2.5 signals with "FE" and "AE" correction factors for one unit were plotted for clarity.) During this period, some power outages and data acquisition problems were noticed and caused data gaps for both types of devices. An error in data transfer script has resulted in loss of bin 6 data (number of particles beyond 10.0 µm), and this type of data was excluded from further analyses.

Measurements results
Nonetheless, all units of Plantower PMS7003 were stable during this campaign and the trends of their outputs were similar to TEOM signals. However, the 1-h TEOM averages were in the range 1-120 µg/m 3 and the PM 2.5 values from raw sensor outputs were significantly overestimated (about three times in case of "FE" outputs). The "atmospheric" ("AE") PM 2.5 output was also not well suited for field measurements-overestimation by a factor of 2.2 was observed. This situation may derive from factory calibration using particles with completely different properties than PM in ambient air. Thus, it was confirmed that this low-cost sensor needs calibration in the final environment of measurements.
Regarding reproducibility, it was high in case of Plantower sensor units. The correlation coefficients between all units were higher than 0.990, and the lowest variability was observed for units no. 1 and no. 2-the correlation was at the level of 0.996 for 1-min data and 0.998 for 1-h data. Outputs from unit no. 3 were to some extent distant from other unit signals, but also highly correlated with them (r ≈ 0.99). The scatter of PMS7003 data is presented in Fig. S1 in Supplementary Material.
The coefficient of variation (CV) computed for all units for 1-min averages was equal to 8.43% and for hourly data was equal to 6.47%. Taking into account the high reproducibility of PMS7003, only one unit of that sensor was used in further analyses (unit no. 2 was chosen arbitrarily). Table S1 in Supplementary Material presents the correlation coefficients (r) for 1-min raw outputs from PMS7003 sensor (mass concentrations and number of particles in bins) and differences between bin 1 and the other bins. The highest correlations were computed for "FE" mass concentrations-the r value between PM 2.5 and PM 1 was equal to 0.995, and for PM 2.5 and PM 10 , it was 0.998. This results show very high linear relationships between PM mass outputs. The ratio between PM 10 and PM 2.5 was generally constant and was equal about 1.1, and the ratio between PM 2.5 and PM 1 was at the level of 1.5. Such simple relationships might be not adequate for ambient air monitoring, where PM mass ratios depend on the pollutant sources and may change during the year [47,48].
Very high linear correlations and similar ratios were observed for "AE" mass concentrations as well. Generally, the "FE" and "AE" outputs were highly correlated (e.g. for PM 2.5 r = 0.988), but there existed some transfer functions that changed the relationships of "FE" and "AE" signals. In the case of PM 2.5 concentration, "FE" and "AE" outputs were the same up to about 25-30 µg/m 3 , and above that level, nonlinear relationship was observed up to about 100 µg/m 3 of "FE" output. Above that threshold, the mass concentrations were again highly linearly correlated, but the "FE" values were 1.5 times higher the "AE". The relationship between the discussed outputs is presented in Fig. S2 in Supplementary Material. PM 2.5 output was highly linearly correlated (r = 0.989) with bin 3 (particles beyond 1.0 µm). This bin was also the most correlated with PM 10 data (r = 0.990). Similar situation was observed for "AE" signals (r ≈ 0.977). In case of combinations of bins, the difference between bin 1 and bin 5 (i.e. particles number beyond 5.0 µm subtracted from particles number beyond 0.3 µm) had the highest r value for "FE" PM 2.5 (0.982). The linear correlations between bins and mass concentrations were in general very high, but it seems that some nonlinear function might be responsible for calculations on mass concentrations and this issue requires further consideration.  Table 1 presents the results of simple regression fittings for TEOM 1400a and PMS7003 outputs. All regression coefficients are provided in Table S2 in Supplementary Material.

Univariate regression results
The highest values of coefficient of determination and smallest RMSE levels were observed for both datasets for mass concentration outputs with "FE" factors. The best results were observed for PM 1 data (R 2 = 0.815 and RMSE = 5.09 µg/m 3 for training set/R 2 = 0.801 and RMSE = 4.97 µg/m 3 for validation data). The PM 2.5 output, which appears to be dedicated to the measurements of that PM fraction, had a somewhat worse fit.
In case of bins and bins combinations, better results were obtained in most situations for the latter ones. The highest R 2 was at the level of 0.78 in training set for difference between bin 1 and bin 2 (all particles number beyond 0.5 µm subtracted from particles number beyond 0.3 µm). Regarding raw bins, model with bin 1 showed the smallest value of error, suggesting that all particles detected by the sensor are mainly related to PM 2.5 mass concentration. However, it should be mentioned that the quality of fitting depends also on the quality of control instrument used for comparison. This aspect is especially significant for TEOM device, which is susceptible to measurement errors under certain conditions [49][50][51]. The detection possibilities of PMS7003 should be therefore further investigated by means of other reference instrument.

Multiple regression results
Complex multiple regression models were created with full set of mass concentration outputs ("FE" and "AE" types) and bin differences, which have proved to be more correlated with TEOM outputs than raw bins data. The comparison of results of such regression fittings and regression that included temperature and relative humidity is shown in Table 2.
The tested multiple regression models were substantially better fitted to TEOM data than univariate models. The R 2 for the first type of multi-parameter model was equal 0.853 for training set, and the result for validation set was 0.837. RMSE errors were lower than in the previous case and equal about 4.5 µg/m 3 .
The addition of temperature and relative humidity to the equation resulted in further improvement in goodness of fit. The value of R 2 increased by approximately 0.02 (to the level of 0.87), and the RMSE error has decreased by around 0.3 µg/m 3 (to the level of 4.2 µg/m 3 ). It should be noted that temperature and RH were registered inside the measurement box and may reflect only the environment in the vicinity of sensors, only in conditions of that study. The inclusion of RH to the model may be beneficial, because of some small impact of high humidity levels on performance of PMS7003 [27]. As regards the temperature, that parameter was moderately correlated with RH during this measuring campaign and the impact of temperature for that sensor has not been investigated so far. For this reason, the incorporation of temperature into the calibration equation may be questionable.

Stepwise regression results and selection of the best subset of variables
Stepwise regression algorithm was utilized for dataset with 12 variables: all types of mass concentration, bin differences and both additional environmental factors: temperature and RH. The algorithm performed 12 steps, resulting in a model with eight independent variables and an intercept. The following predictors were chosen by this algorithm: PM 10 "FE", PM 1 "AE", PM 2.5 "AE", PM 10 "AE", bin 1-bin 3, bin 1-bin 5 and also temperature and RH. The value of R 2 did not significantly decreased as compared to previously described equation and was at the level of 0.874 for testing set and 0.860 for validation set. The change in RMSE was unnoticeable. Regression coefficients for that model are given in Table S3 in Supplementary Material. The results of stepwise selection were compared to the selection based on all possible regressions and information criteria: AIC and BIC (Table 3). Generally, all selection methods gave similar results in terms of goodness-of-fit indicators. The choice based on AIC criterion gave model with the lowest error for validation set, but that model consisted of the largest number of predictors (10). BIC criterion pointed to more truncated model with seven variables. It should be noted that BIC value from stepwise algorithm was only slightly higher than that model and R 2 and RMSE for both equations were practically the same. The other important issue is that all of the presented models did not include the raw PM 2.5 value with "FE" factor, but include "AE" mass concentrations. Moreover, temperature was also included to those models with relative humidity.

Neural network results
Neural network was created with inputs in form of all types of mass concentration, bins, bin differences and temperature and RH (17 variables). The training of neural network took 31 epochs, and the results of fitting to TEOM data are presented in Table 4. This algorithm gave the best results in terms of values of R 2 and RMSE, when compared to other methods of fitting. In training set, the R 2 value exceeded 0.9 and good approximation was observed also for validation set (R 2 ≈ 0.88), pointing to satisfying generalization capabilities of that structure. RMSE was below 4 µg/m 3 in case of both datasets. Figure 2 presents the comparison of fitting possibilities of four developed models: (a) univariate regression model with PM 1 "FE" data, (b) multiple regression model with 12 variables: mass concentrations, bin differences, temperature and RH, (c) multiple regression model from stepwise selection with eight variables and (d) neural network with 17 inputs: mass concentrations, bins, bin differences, temperature and RH. The goodness of fit for presented models was good (R 2 > 0.8) or very good (R 2 > 0.9). Neural network gave overall better fitting results for training and validation data, but some deviation from ideal relationship was observed above 100 µg/m 3 (Fig. 2d). The linearity of outputs was the highest for multiple regression models ( Fig. 2b and c), and in the case of fitting with only one predictor (Fig. 2a), the performance of data adjustment might be improved by means of some nonlinear equation.

Comparison of regression methods
In addition, it has been observed that all models were characterized with some larger data scatter for concentration range below ~ 30 µg/m 3 . The reason for that dispersion may not necessarily derive from the sensor operation, but may arise from the performance of TEOM analyser, as discussed in [27].
Additional evaluation of developed models was made for measurement results from the two other PMS7003 Table 3 Results of choice of the best multiple regression models. The first model was chosen with the stepwise regression algorithm, the second was chosen on the basis of AIC criterion, and the third was chosen on the basis of BIC criterion    greater error when signals from unit no. 3 where taken into account (R 2 ≈ 0.77, RMSE ≈ 5.6 µg/m 3 ). The similar situation was observed for neural network-that structure manifested the smallest error for dataset from sensor no. 2 (R 2 ≈ 0.9, RMSE < 4 µg/m 3 ) and good fitting to unit no. 1 data (R 2 ≈ 0.81, RMSE ≈ 5.0 µg/m 3 ), but the performance on dataset from unit no. 3 was considerably worse (R 2 ≈ 0.67, RMSE ≈ 6.7 µg/m 3 ). It might be thought that the trained NN structure was overfitted to unit no. 2 signals and functioned still well with comparable data from sensors no. 1, but the dissimilar outputs from unit no. 3 have resulted in higher inaccuracy. Such behaviour was not noticed for equations from multiple regression-in particular, the model selected by the stepwise regression was robust to slightly different data from sensor no. 3. The R 2 value was in the range 0.85-0.87 and RMSE reached 4.2-4.6 µg/m 3 when all units of PMS7003 were considered.

Conclusions
The results of the 1-year collocation study confirmed that low-cost optical sensors may be a useful tool for indicative monitoring of PM 2.5 changes in the ambient air. In particular, the sensors like PMS7003 from Plantower could be used in nodes of widely dispersed networks, because of the high reproducibility between units. In such case, calibration equations developed for one unit might be used for others, with a negligible loss of accuracy.
In this paper, different calibration equations were evaluated. The results of univariate regression showed that the raw sensor outputs dedicated to PM 2.5 mass concentration do not have to be the best option to establish relationship with the control instrument at all. Regarding the PMS7003, the output for PM 1 was better in terms of higher R 2 value and lower RMSE than PM 2.5 output.
The fitting quality might be improved when multiple regression is taken into account. A set of outputs from PMS7003 (mass concentrations, number of particles in bins or bin differences) may be used to construct more complex models, more suited to reference data. The inclusion of additional variables, like temperature and relative humidity, could also be beneficial. Furthermore, the results of conducted comparison demonstrated that stepwise regression might be used to select models that represent the compromise between simplicity and accuracy of fit. What is important, the selected models did not contain the raw PM 2.5 output.
Regression models were also contrasted with neural network. That algorithm proved to be very effective in adjustment of signals from sensor selected for training. Nevertheless, it was more susceptible to deviations in data from other units. That problem was not so acute in case of regression models.
Overall, the study showed that raw signals from lowcost sensors have to be adjusted to obtain outputs matched with reference devices. In such case, regression analysis can support the development of calibration equation for data processing module of the sensor system.