Coupling of quantile regression into boosted regression trees (BRT) technique in forecasting emission model of PM10 concentration

Air pollution is currently becoming a significant global environmental issue. The sources of air pollution in Malaysia are mobile or stationary. Motor vehicles are one of the mobile sources. Stationary sources originated from emissions caused by urban development, quarrying and power plants and petrochemical. The most noticeable contaminant in the Peninsular of Malaysia is the particulate matter (PM10), the highest contributor of Air Pollution Index (API) compared to other pollution parameters. The aim of this study is to determine the best loss function between quantile regression (QR) and ordinary least squares (OLS) using boosted regression tree (BRT) for the prediction of PM10 concentration in Alor Setar, Klang and Kota Bharu, Malaysia. Model comparison statistics using coefficient of determination (R2), prediction accuracy (PA), index of agreement (IA), normalized absolute error (NAE) and root mean square error (RMSE) show that QR is slightly better than OLS with the performance of R2 (0.60–0.73), PA (0.78–0.85), IA (0.86–0.92), NAE (0.15–0.17) and RMSE (9.52–22.15) for next-day predictions in BRT model.


Introduction
The Air Pollution Index (API) describes the current state of air quality in a given region. The Department of Environment (DOE), Ministry of Environment and Water is one of the government agencies responsible for monitoring air quality at 68 stations in Malaysia. The API was then introduced to measure the cleanliness and efficiency of the air (Leong et al. 2020). The Malaysia Ambient Air Quality Guidelines (MAAQG) is used to determine the level of air quality in Malaysia and is used to measure the concentration levels of particles less than 2.5 μm (PM 2.5 ), particles less than 10 μm (PM 10 ), carbon monoxide (CO), sulphur dioxide (SO 2 ), nitrogen dioxide (NO 2 ) and ozone (O 3 ). When the concentration level is above the level specified in the MAAQG for a long period of time, it will cause negative effects on health and the environment. The API in Malaysia is listed in Table 1 with its categorization as good, moderate, unhealthy and hazardous. Generally, PM 10 is identified as a major pollutant that causes unhealthy conditions (DOE 2018). Therefore, PM 10 is the main focus of this study.
According to Azmi et al. (2010), the main causes of air pollution in Malaysia are either mobile sources from cars, buses and planes or stationary sources from power plants, open burning and wildfires, industrial facilities and others. The occurrence of haze in Malaysia is as a result of biomass burning since 1982 interrupting everyday life in Malaysia (Latif et al. 2018). Several haze episodes have been reported since then. These extreme episodes occurred in 1997, 2005 and 2015. Severe haze episodes were recorded in 1997 due to forest fires and large-scale plantations, especially in southern Sumatra and central Kalimantan, both in a neighbouring  (Zakri et al. 2018). A further episode of extreme haze was reported in 2005 (Sahani et al. 2014) which was mainly on the Peninsula's west coast of Malaysia. At that time, the smoke haze heavily affected the Klang Valley and its surrounding area. It reached its height at the haze emergencies on 11 August 2005, as the Air Pollution Index (API) reading in Port Klang and Kuala Selangor was recorded to be above 500. The latest extreme and long haze episode in Malaysia was reported in September 2015 (Huijnen et al. 2016). PM 10 concentration is the most significant major pollutant released by human activity (Sapini et al. 2015). Specifically the PM 10 concentration, in most cities of Southeast Asia (Reddington et al. 2014) and in Malaysia (Juneng et al. 2011), is justified as the main atmospheric pollutant. PM 10 contributed most to Malaysia's API until 2017. In mid-2017, PM 2.5 had a greater impact on APIs in Malaysia until 2018(DOE 2018. There has been a growing interest in using many statistical models in the prediction of air pollution in recent years. One of these is regression techniques which have been used for a long time as predictive tools in many fields especially in the prediction of air pollution. The benefits of regression models are for its ease of use and efficient execution. However, these models are not very good in the prediction of complex situations, as the linear relationship between the selected parameters is determined (Abdullah et al. 2016). The statistical method is limited in clarifying the factors influencing PM 10 , due to statistical assumptions and the homogeneity of the data. Recent studies have attempted to develop powerful computing intelligence models using machine learning algorithms such as the neural network to predict the complex PM 10 concentration system, which indicate that such models can easily predict the desired value (Abdullah et al. 2017). However, machine learning, more specifically the neural network, is usually used as a black box where there is no specific understanding of the physical characteristic of the technique (Viotti et al. 2002).
The boosted regression tree (BRT) model, another type of machine learning, which combines the advantages of regression trees with the boosted adaptive method, has recently been used in air pollution prediction studies. The boosting method was first developed by Friedman in (2001), and later added a stochastic aspect to the boosting algorithm through a random sample of the training data sets (Friedman 2002). In addition, it can also be used as a general method that is useful to improve the model accuracy of each learning algorithm. The BRT produces an ensemble model by boosting the loss function (such as root mean square error) of the user-defined number of additional trees by minimizing it. In contrast to the black box technique, the BRT method would evaluate the response of variables based on the individual model variable.
It is therefore possible to determine, rank and describe the relationship between variables (Yahaya et al. 2019). The BRT is also capable of handling various types of inputs (i.e. categorical and continuous data) and accepts missing values (Motevalli et al. 2019) and able to deal with multiple forms of loss functions (Ridgeway 2012), such as Gaussian, Laplace, quantile regression (QR), Bernoulli and Poisson.
The loss function is one of the BRT model factor considerations. Ordinary least squares (OLS) loss function has been used by many studies, for the purpose of minimizing the squared error for continuous predictors, which resulted in a better correlation between the observed value and the estimation of the generalized boosting model (GBM) (Gu et al. 2019). However, datasets that have outliers such as air pollution data are not suitable to be used in OLS function. According to Kudryavtsev (2009), QR has become an important robust alternative tool, as it is more resistant to outliers and it is free function and does not have any properties.
The QR has the ability to be more useful and precise, since the non-central location of a distribution can be represented in all quantiles (Lingxin and Naiman 2007). The QR has the capability of including models for all quantiles, evaluating the entire function and calculating the central tendency (such as mean, median and mode) in the entire function of the variable of interest. The advantage of QR is for its robustness against non-OLS distribution which was found by Schlink et al. (2010). It can also be adapted to unbalanced observational frequencies. Due to this property, QR was considered and selected as a loss function strategy for this study.
The aim of this study is to derive air pollution modelling based on the loss function of QR using the BRT method. It is clear from the literature that no study has been conducted using such a method to predict PM 10 concentrations. The finding from the proposed methodology is compared with the prediction obtained from the OLS loss function using the BRT method.

Methodology
The process of data preparation has been conducted in detail to reach for developing the model evaluation as illustrated in Fig. 1. The flow diagram is adapted to the author's research and is reconstructed.

Data preparation
Three urban sites were selected for this study. Table 2 shows the characterization of each station. All stations are located in the peninsular Malaysia. Alor Setar station (CA0040) is located in the northern region, Klang (CA0011) is located in the west coast region and Kota Bharu (CA022) is located in the east coast region as shown in Fig. 2. Data are operated by the Department of Environment's continuous air quality monitoring (CAQM) stations in Malaysia. CAQM is an integrated ambient air quality monitoring device, is outfitted with a variety of ambient air analyses and sensors to identify particular pollutants. The analyses and sensors operate in a continuous mode, with data collected being captured on a microcomputerbased data acquisition system (DAS) that also controls the performance of the analyses and sensors. On an hourly basis, data is collected and transferred to a central computer for review and reporting. The United States Environmental Protection Agency (USEPA) has authorized the monitoring instruments and operational protocols of CAQM stations (Kamarul Zaman et al. 2017).
For data exploration, a descriptive analysis is carried out to determine the existence of extreme values or missing values. Missing data is a problem commonly faced by researchers in environmental studies. Data discontinuities are a major obstacle to the prediction models that require continuous information for the majority of the parts to be used. The absence of any data prevents the ability to accurately conclude or interpret the observation (Noor et al. 2014). The missing data must be processed, because complete data are required to perform statistical analysis. This study used linear interpolation for missing data imputation. According to Noor et al. (2015), this linear interpolation method estimates the missing data better than that of the other methods.

Data pre-processing
Maximum daily data used in this study were furnished by the Department of Environment (DOE), Ministry of Environment and Water of Malaysia for the period of 2002 to 2017. The data for this project are confidential, but may be obtained with Data Use Agreements with the Department of Environment (DOE), Ministry of Environment and Water of Malaysia. The data was 80% randomly selected for training and another 20% for the validation of the model (80% for model development and 20% to evaluate the performance of the model). The variables used in this study consist of gaseous nitrogen dioxide (NO 2 ; ppb), carbon monoxide (CO; ppb), sulphur dioxide (SO 2 ; ppb), ozone concentration (O 3 ; ppb), particulate matter concentration (PM 10 ; μgm −3 ) and meteorological parameters such as wind speed (WS; km/h), relative humidity (RH; %) and temperature (T;°C), as the predictors used to predict

Model development
BRT is a method used to increase the accuracy of a single model by fitting a number of models and combining them for prediction purposes. BRT uses regression trees from the classification and regression tree (CART) and constructs boosts to combine model sets (Grunwald et al. 2020). In the BRT, there are several tuning parameters that need to be controlled such as the number of trees (nt), the learning rate (lr) which is the shrinkage parameter used in each iteration to reduce the contribution of the tree, the complexity of the tree (tc) or the interaction depth which is the maximum tree depth of variable interactions. This study fitted BRT models with varying values for nt (10,000), lr (0.01) and tc (5). In version 3.4.2 of the R software, the BRT model was fitted with version 1.6-3.1 of the GBM. The GBM offers three methods for estimating the optimum number of trees, i.e. the cross validation (CV), the independent test set (test) and the out-of-bag estimation (OOB). This research used 10-fold cross validation as suggested by  to get the optimum number of trees, and then, ten separate testing sets were averaged. Rather than worrying about the block being suitable for testing, CV employs them all, one at a time, and summarizes the results at the end. The independent test set (test) approach uses a single holdout base dataset to determine the optimum number of tree . This research used a 50% held out test set to find the optimum number of trees as suggested by Ridgeway (2017). Out-of-bag estimation (OOB) is used to evaluate the classifier. According to Martinez-Munoz and Suarez (2010), individual classifiers are trained in standard bagging on independent bootstrap samples extracted with replacement from the set of original data. In general, the size of these samples is chosen to align with the number of the original training dataset. This prescription is arbitrary and does not have to be optimal in terms of the ensemble's generalization accuracy. The accuracy of the voting classifier is equal to the average of classifier. Bag.fraction 0.5 was used in this research, as suggested by Ridgeway (2020), to improve predictive performance while using the OOB method. BRT constructs a model as a weighted sum of functions similar to other boosting algorithms. The BRT algorithm steps are summarized accordingly: Start the model with a constant value F 0 (x).The BRT algorithm steps consist of a suitable decision tree and a loss function to determine how well a study is predicted. At each stage, the decision tree h m (x) is chosen to minimize the loss given the current model F m − 1 and its fit F m − 1 (x i ). The residuals r i, m are computed: r i, m is the negative gradient of the ith sample in the mth as the number of trees. h m (x) is set to use the r i, m as the target variable. Fit a regression tree to the residual r i, m values and create the leaf node area R j, m for j = 1, 2, …, J. The weights are obtained by solving the problem of minimization:   The square error is the loss function for the deterministic prediction: For quantile regression, the expression below is used when the α (quantile) value is in range 0 to 1.
R j, m is a leaf node, the jth being the number of leaf in the tree and υ is a learning rate. Update the current model: It is a method of looping that fits the regression tree. Then, once the first tree is added to the model, tree error prediction will be taken into account to balance and boost the accuracy of the next tree.

Model evaluation
Performance indicators in this research work are used to determine the accuracy and errors of BRT with different loss function (OLS and QR). The indicators used to identify the best method for the prediction of PM 10 concentration were the root mean square error (RMSE), normalized absolute error (NAE), predictive accuracy (PA), agreement index (IA) and coefficient of determination (R 2 ). The RMSE and NAE were used to find a model error where a value closer to 0 demonstrated a better model. Meanwhile, the other three performance indicators, i.e. IA, PA and R 2 , were used to verify the accuracy of the model outcome, where a higher accuracy is given by a value closer to 1. The equations displayed in Table 5

Results and discussion
The descriptive statistics and box plots for maximum daily PM 10 Table 7 gives the summary of the descriptive statistics for all parameters' maximum daily data of Klang for 2002 to 2017. The mean values for the area in 16 years are higher than their respective median which indicates that the pollutant distributions are positively skewed (also called right-skewed). The maximum value for air pollutants was PM 10 643 μg/m 3 , O 3 127ppb, CO 10,500 ppb, NO 2 128 ppb and SO 2 150 ppb. Klang has the highest mean and median values compared to other locations. This may be due to the fact that extensive industry operates in Port Klang, the most densely populated and developed region in Malaysia (AL-Dhurafi et al. 2017. It has the smallest standard deviation, despite the highest central tendency value, indicating that this area has continuously encountered very high concentrations. Table 8 demonstrates the result of the descriptive analysis of air pollutant concentration and meteorological parameter for Kota Bharu, Kelantan. The mean values for PM 10 (48.73 μg/m 3 ), O 3 (29.21 ppb), CO (926.26 ppb) and NO 2 (15.15 ppb) were higher than the median value. Therefore, the distributions of these measurements were skewed to the right, indicating that there were several observations of high concentration of air pollutant occurred in the years 2002-2017. Meanwhile, the mean value for RH (91.86%) and T (31.36°C) was lower than the median value which indicates the distribution of data was skewed to the left. These results show that the weather in Kota Bharu is mainly hot and dry, which means that the observation of humidity this year seems to be less humid.
The relative influence (RI) was computed to identify the strength of each predictor-response variable relationship. According to Sayegh et al. (2016), the BRT modelling technique can be used to identify the influence of different predictors on response variable. The most important predictor identified for the maximum daily PM 10 concentration for the next day (D + 1) was PM 10 concentration for the previous day, where Alor Setar has 90.17%, Kota Bharu 59.72% and Klang 54.68%. PM 10 concentration for the previous day played a remarkable role in explaining more than 50% of the variance in the BRT model. The least important predictor was found to be SO 2 , where Alor Setar has 0.30%, Kota Bharu 2.77% and Klang 3.02% (Fig. 4).
The BRT models using OLS loss function and compared test, 10-fold CV and OOB methods are shown in Table 9. Performance indicator has been used to assess the accuracy of the fit to the BRT model in order to determine which method better predicts PM 10 concentration in Alor Setar, Klang and Kota Bharu for the 3 days ahead. This study predicts up to 3 days ahead because, according to Perimula (2012), the government will be able to announce warning status if the API exceeds 101 for more than 72 h.
The best OLS loss function in BRT models with the lowest total ranking is shown in Table 10. For error measurements, the values are ranked from the smallest (rank = 1) to the largest (rank = 3), and for accuracy measurements, the values are ranked from the largest (rank = 1) to the smallest (rank = 3). The total ranking has been determined. This procedure was repeated until the next 3-day (D + 3) prediction to decide the best BRT models for the three stations in this study.
The results show that for the next-day prediction independent test set is better than OOB and CV for all sites. The coefficient of determination (R 2 ) for Alor Setar, Klang and Kota Bharu was 0.70, 0.60 and 0.65, respectively, while the  However, the next 3-day prediction suggests that the CV is the best method for Alor Setar and Klang, but for Kota Bharu independent test set is the best method which predicts PM 10 concentration. Overall, the model's performance verified that the next-day prediction is better than the next 2-day and next 3-day prediction.
Descriptive analysis shows that the data for this study is non-central condition because it contains outlier; therefore, this study uses quantile regression as explained by Kudryavtsev (2009). Performance indicators have been used to identify the best quantile to predict the next-day (D + 1) PM 10 concentration at Alor Setar as summarized in Table 11. Of the five performance indicators used, NAE and IA indicate that 0.5 quantile gave better fit than other quantiles, but the valley differed by just 0.01 with 0.55 quantile. However, RMSE, PA and R 2 have shown that 0.55 quantile is the best quantile in PM 10 concentration models. 0.55 quantile was therefore used to predict the PM 10 concentration models for the OOB method. For CV and OOB methods, the presented results demonstrate that 0.5 gave better fit than other quantiles. After choosing the right quantile to present the best PM 10 concentration prediction models for the next day, repeat the same process for finding the best quantile for the next 2-day and next 3-day prediction for all selected locations.
The best quantile to predict the next 2-day (D + 2) PM 10 concentration at Alor Setar is reported in Table 12. Results show that 0.5 is the best quantile for CV and Test method for the next 2-day prediction, while for OOB is 0.4. The chosen quantile for the next 3-day at Alor Setar is described in Table 13. The findings revealed that all methods (OOB, CV and Test) have the same result, which is 0.55 as the best quantile.
After selecting the best weighting for OOB, CV and Test, the next step is to determine the best method for the next-day, the next 2-day and the next 3-day prediction. The best weighting function was identified for the next day, the next 2 days and the next 3 days in Table 14 for all three monitoring stations by repeating the same procedure for the proposed PM 10 concentration prediction method. The best prediction model for next-day PM 10 concentration in Alor Setar is OOB (quantile = 0.55) with an error of 0.1464 (NAE) and 9.3260 (RMSE), with an accuracy of 0.9177 (IA), 0.8546 (PA) and 0.7291 (R 2 ). For Klang and Kota Bharu, CV (quantile 0.5) is the best method. The CV and Test models were selected to predict the PM 10 concentration for the next 2day while for the next 3-day only Alor Setar shows that OOB The best loss function representing each monitoring station can be identified according to the results of the performance indicator in Table 15. Of the five performance indicators applied, all sites indicate that QR was slightly better than OLS. This is supported by Khan et al. (2019), which states that QR can be utilized for the prediction of extreme events. Norazrin et al. (2018) investigated the Bayesian regression model using conjugate prior distribution and get the results for RMSE (4.66 to 9.88), IA (0.900 to 0.929), PA(0.830 to 0.866) and R 2 (0.614 to 0.665). While, Park et al. (2018) predicted PM 10 concentration in Seoul metropolitan subway stations using artificial neural network (ANN) model and presented R 2 of 0.39 to 0.81. On the other hand, Abdullah et al. (2020) showed the results from performance error ) and NAE (0.33-0.43) by using multiple linear regression for PM 10 forecasting during episodic trans-boundary haze event in Malaysia. In addition, Shaziayani et al. (2018) reported that feed forward back propagation performs better than general regression neural network in Seberang Jaya, Pulau Pinang with an IA of as much as 0.7796 for the next day, 0.6033 for the next 2-day and 0.8024 for the next 3-day predictions. Overall, this implies that the values of performance indicators of this study are almost the same as those of previous researchers. This paper shows that alpha 0.5, 0.55 and 0.60 are the best quantile as recommended by Ul-Saufie et al. (2012), which is appropriate for data on air pollution in Malaysia. Therefore, the proposed model can be used as an alternative method to predict the concentration of PM 10 in Malaysia. Figure 5 shows the comparison between the observed value and predicted value of Alor Setar, Kota Bharu and Klang for the validate data set. The optimum setting value from the training data set is tuned with the number of learning rate at 0.01 and iteration at 10,000. By using the optimum value found in the training process, the accuracy of this BRT prediction is found to be 60.33 to 91.77%.

Conclusion
Overall, these results indicate that the quantile regression has fulfilled the assumptions and the good model for BRT for predicting maximum daily PM 10 concentration. The study findings show that the values of NAE (0.15-0.17), RMSE (9.33-22.25), R 2 (0.60-0.73), IA (0.85-0.92) and PA (0.78-0.85) were good for the next-day predictions. Most of the results used 0.5 as the best quantile which represents the median data, but 0.55 and 0.6 had also been chosen as the best  quantile because the model has more number of outliers compare to the other models. Overall, the results showed that the number of quantile is greater than the median value (0.5). In conclusion, QR is an alternative loss function for BRT to predict the 3 days ahead of PM 10 concentration for all sites and suitable for data containing influence outlier. This model can help local authority to take action to reduce the effect of haze in Malaysia.