This section proposes a new system that leverages the ever-growing set of single board computers (SBC) that contain hardware powerful enough to perform a reasonable level of computation with low cost and power consumption. The following diagram illustrates the components of the proposed design (Fig. 1).
On the edge of the system exists an instance of SBC, a Raspberry Pi 4. The Raspberry Pi 4 is responsible for controlling and collecting data from multiple sensing stations via Message Queuing Telemetry Transport (MQTT). Hence, the edge device will act as an MQTT broker for all sensing stations, MQTT clients. Each station gathers readings from the connected sensors via a multitude of inputs available in an Arduino-compatible device equipped with Wi-Fi capabilities, such as NodeMCU, Arduino Uno Wi-Fi, Uno, Wi-Fi R3, amongst others. Data could be sent to the Raspberry Pi through its General-Purpose Input Output (GPIO) pins or other inputs if Wi-Fi is unavailable. Sensors may include MQ gas sensors, humidity and temperature sensors like DHT-11 or DHT-22 and PM sensors. The stations may be placed in the same city in industrial or residential locations or distributed across the country, according to the authority’s needs.
After collecting data from the attached sensors for its configured period (mostly 24 h to 1 week) [20, 22], the edge device is responsible for calculating Air Quality Index (AQI) as well as predicting the next time step or steps (minutes, hours, days, real time) according to its configuration. It may also warn its local vicinity or perform other tasks as configured by the authority or its operator. Afterwards, it may compress available readings and send them to the central cloud for further processing and prediction on a large scale. A system composed of these edge devices would broadcast their raw data to the cloud, which helps in making pivotal decisions and predicting next time-steps for the whole area monitored by the system. The cloud would also help estimate and predict AQI for areas without edge devices, and it may even send corrective data to the edge devices to better predict air pollution concentration level in their local region according to data collected from other neighbouring areas.
This system could be used in multiple configurations, including industrial establishments, especially those dealing with environmentally hazardous substances and other factories in general. In addition, the average consumer would benefit from such a system that could work independently from the cloud if required. Also, in governmental settings, this would give the big picture of the air quality situation nationwide. Finally, this system has a flexible configuration as it does not require fixed/static installations and can be mounted on moving vehicles with appropriate adjustments. The system has not been fully implemented yet just the edge part was implemented using a Raspberry Pi 4 device, and the next phase of this research study is to complete the full implementation.
Practical implications for implementing the proposed system
The proposed system will have multiple layers in terms of data flow, as shown in Fig. 2.
The layers presented in the figure above show the logical flow of transmission and processing of data by many devices and networks according to the available resources upon implementation.
The components of the system are:
-
1.
IoT edge devices:
-
a.
The IoT Sensors layer contains the sensors required in the prediction process. The sampling rate can be fixed or controlled by the IoT Edge Computing Nodes layer. This layer can get multiple readings including but not limited to: relative humidity level (%), temperature (°C), altitude (m), pressure (hPa), carbon monoxide CO (ppm), carbon dioxide CO2 (ppm), particulate matter of 0.3 ~ 10 µm in diameter (µg/m3), ammonium NH4 (ppm), methane CH4 (ppm), wind direction (°deg), wind speed (m/s), detected Wi-Fi networks, and their signal strength in decibels. IoT Edge devices layer comprises:
-
i.
Wired sensors transmit data through numerous methods, such as (Inter-Integrated Circuits—I2C, Serial Peripheral Interface—SPI, and Universal Asynchronous Receiver/Transmitter—UART) to the next layer.
-
ii.
Wireless sensors in which data are sent via wireless protocols (ZigBee/Z-Wave). Data could be carried using MQTT over Zigbee protocol called (MQTT-SN).
-
b.
IoT edge computing nodes:
Here smart edge devices can be used to process collected data and send either a summary or a stream of the current readings to the cloud or perform the required local prediction directly using the computing power available to them. Example of these nodes is SBCs, Arduinos, and Arduino-compatible devices.
-
2.
IoT network/internet:
Communication between IoT Edge Devices and the IoT cloud is carried through this layer. First, IoT gateways coordinate between various IoT Edge nodes in terms of network usage and cooperation. For example, SBCs from the previous layer could be used as IoT gateways. Then, the connections are relayed to the cloud via many possible network facilities, such as mobile technologies (2G-3G-4G-5G-Narrowband IoT), Low-Power Wide Area Network (LPWAN) technologies including (Long-Range Wide Area Network (LoRaWAN) and Sigfox) or Wi-Fi. Finally, it is required to provide secure and reliable linkage to the IoT Cloud layer with good coverage across the area to be monitored.
-
3.
IoT cloud:
All data collected from various stations in the system are processed in this part of the data flow. This part could be optional if the prediction is entirely made on the edge devices. However, for a bigger picture and more accurate results, central management and processing add higher value.
Usually, the processing cloud comprises Infrastructure-as-a-Service (IaaS) or Container-as-a-Service (CaaS) cloud services, on top of which other services may run. For example, MQTT brokers may run in a container hosted in a virtual machine, or they can run directly on the hypervisor if supported like vSphere 7.0 by VMWare [32]. The container can also be run in various systems such as Amazon Elastic Compute Cloud (EC2), serving containers like Docker and Kubernetes. A virtual machine could have a container instance running the MQTT broker and another running web services conforming to REpresentational State Transfer (REST) standards—also known as RESTful web services. Besides, a (not only SQL—NoSQL) database server and a webserver would be in the virtual machine to serve the RESTful requests forwarded by the broker and store data required, respectively. Many virtual machines may exist for multiple areas for scalability. The data stored can be processed, and coordination between IoT devices can be made by a specialized IoT platform as a service software tool. To make large-scale predictions and decisions, data analytics and business intelligence, as well as specialized AI prediction algorithms, may be deployed.
-
4.
Front end clients:
Web services API calls may be made to deliver helpful information for various clients, create alerts and historical or live maps of the requested area’s situation.
Prediction algorithms
To help build the proposed system, multiple prediction algorithms were compared to determine the best and most efficient one for use at both the edge and the central cloud.
Non-linear AutoRegression with eXogenous input (NARX) model
NARX is mainly used in time series modelling. It is the non-linear variant of the autoregressive model having exogenous (external) input. The autoregressive model determines output depending linearly on its past values. Hence, NARX relates the current value of a time series to previous values of the same series and current and earlier values of the driving (exogenous) series. A function exists to map input values to an output value. This mapping is usually non-linear—hence NARX—and it can be any possible mapping functions including Neural Networks, Gaussian Processes, Machine Learning algorithms and others. The general concept of NARX is illustrated in Fig. 3 [33].
The model works by inserting input features from sequential time-steps t and grouping past time-steps in parallel into the exogenous input order each of length q. If required, each of these features can be delayed by d time-steps. This means that for each input feature you can choose how many timesteps to include using exogenous order q and delay that amount of data by d. Figure 3 shows that by including only one input feature marked as x1 and using q1 input order and d1 delay. Meanwhile, the target values are stacked similarly, representing autoregression order of length p. Direct AutoRegression (DAR) is another variant in which the predicted output is used as an autoregression source rather than externally [34]. A library named fireTS has been implemented in Python by [35] to apply NARX using any scikit-learn [36] compatible regression library as a mapping function. NARX can be represented mathematically as [34]
$$ \hat{y}\left( {t + 1} \right) = f\left( {y\left( t \right), y\left( {t - 1} \right), y\left( {t - 2} \right), \cdots ,y\left( {t - p + 1} \right),X\left( {t - d} \right),X\left( {t - d - 1} \right), X\left( {t - d - 2} \right), \cdots , X\left( {t - d - q + 1} \right)} \right) $$
(1)
where \(\widehat{y}\) is the predicted value, \(f\left(.\right)\) is the non-linear mapping function,\(y\) is the target output at various time-steps \(t\), \(p\) is the order of target outputs (autoregression) used specifying how many time-steps to use of the target of prediction, \(X\) is input features matrix, \(q\) is a vector specifying the order of exogenous input determining how many time-steps to inject from each of the input features, and \(d\) is a vector representing the delay introduced to each of the input features.
Long short-term memory (LSTM)
Long short-term memory algorithm is one of the algorithms that are used frequently for analysing time series data. It receives not only the present input but results from the past as well. This process is executed by utilizing the output at time (t-1) to be the input at time (t), accompanied by the fresh input at time (t) [37]. Hence, there is' memory' stored within the network, in contrast to the “feedforward networks”. This approach is a crucial feature of LSTM as there exists constant information about the preceding sequence itself and not just the outputs [38]. Air pollutants vary over time and health threats are related to long-term exposures to PM2.5. During long periods, it is manifest that the best forthcoming air pollution predictor is the prior air pollution [39]. Simple Recurrent Neural Networks (RNNs) often require finding links among the final output and input data. Storing several time-steps before are limited as there exist several multiplications (an exponential number) that occur within the net hidden layers. These multiplications result in derivatives that will progressively fade away; consequently, the computation process to execute a learning task becomes difficult for computers and networks [37].
For this reason, LSTM is a suitable model because it preserves errors within a gated cell. On the other hand, simple RNN usually has low accuracy and major computational bottlenecks. A comparison between simple RNN and LSTM RNN is presented in Figs. 4, 5 [40].
It is evident from Figs. 4, 5 that the memory elements in Fig. 5 are the main difference between the structure of RNN and LSTM.
The process of forward training of LSTM is formulated via the following equations [41]:
$$ f_{t} = \sigma \left( {W_{f} \cdot \left[ {h_{t - 1} ,x_{t} } \right] + b_{f} } \right) $$
(2)
$$ i_{t} = \sigma \left( {W_{i} \cdot \left[ {h_{t - 1} ,x_{t} } \right] + b_{i} } \right) $$
(3)
$$ C_{t} = f_{t} *C_{t - 1} + i_{t} *\tanh \left( {W_{C} \cdot \left[ {h_{t - 1} ,x_{t} } \right] + b_{c} } \right) $$
(4)
$$ o_{t} = \sigma \left( {W_{o} \cdot \left[ {h_{t - 1} ,x_{t} } \right] + b_{o} } \right) $$
(5)
$$ h_{t} = o_{t} *\tanh \left( {C_{t} } \right) $$
(6)
where \({i}_{t}\),\({o}_{t}\) and \({f}_{t}\) are activation of the input gate, output gate and forget gate, respectively; \({C}_{t}\) and \({h}_{t}\) are the activation vectors for each cell and memory block, respectively; and \(W\) and \(b\) are the weight matrix and bias vector, respectively. Also \(\sigma \left(\bullet \right)\) is considered the sigmoid function defined in (7) and \(tanh\left(\bullet \right)\) is the tanh function, specified in (8).
$$ \sigma \left( x \right) = \frac{1}{{1 + e^{ - x} }} $$
(7)
$$ \tanh \left( x \right) = \frac{{e^{x} - e^{ - x} }}{{e^{x} + e^{ - x} }} $$
(8)
Random forests (RF)
The algorithm of Random forests can be defined as a collection of decision trees, where every single tree is employing the best split for its construction. Each node in the predictor’s subset is picked randomly at that node. Then, for the prediction step, the majority vote is taken.
Random forests possess two parameters:
Random Forest algorithm starts by first obtaining ntree bootstrap samples from the original data. Next, an unpruned classification or regression tree is grown using mtry of sampled random predictors for each sample. Then, the fittest split is chosen at each node. Eventually, predictions are carried out using the predictions aggregation of ntree trees, such as the average or median, for regression and majority poll for classification.
To calculate the error rate, predictions of the out-of-bag samples, which means the data are not included in a bootstrap sample, are used [42, 43].
Extra trees (ET)
Extra Trees machine learning algorithm depicts a tree-based ensemble approach operated in supervised regression and classification problems. Its central notion is about constructing regression trees ensemble or unpruned decision trees per the top-down classical procedure. Moreover, it builds wholly randomized trees whose structures are separate from the learning sample result values in extreme cases.
Extra Trees and Random Forest revolve around the same idea. In addition, though, Extra Trees selects the best feature at random in conjunction with the corresponding value during splitting the node [44]. Another distinction between Extra Trees and Random Forest is that Extra Trees uses all components of the training dataset to train every single regression tree, whereas Random Forest trains the model using the bootstrap replica technique [45].
Gradient boost (GB)
Gradient Boost is one of the ensemble-learning techniques in which a collection of predictors come together to give a final prediction. Boosting requires predictors to be made sequentially; hence training data are fed into the predictors without replacement leading to new predictors learning from previous predictors [46]. This sequential process reduces the time required to reach actual predictions. In addition, gradient boosting uses weak learners/predictors to build a more complex model additively. These predictors are usually decision trees.
Extreme gradient boost (XGB)
XGBoost is another ensemble scalable machine learning algorithm for gradient tree boosting used widely in computer vision, data mining, and other domains [47]. The ensemble model used in XGBoost—usually a tree model — is trained additively until stopping criteria are satisfied, such as early stopping rounds, boosting iterations count, amongst others. The objective is to optimize the \(t\)-th iteration by minimizing the subsequent approximated formula [47]:
$$ {\mathcal{L}}^{\left( t \right)} \simeq \mathop \sum \limits_{i = 1}^{n} \left[ {l\left( {y_{i} ,\hat{y}_{i}^{{\left( {t - 1} \right)}} } \right) + \partial_{{\hat{y}^{{\left( {t - 1} \right)}} }} l\left( {y_{i} ,\hat{y}_{i}^{{\left( {t - 1} \right)}} } \right)f_{t} \left( {x_{i} } \right) + \frac{1}{2}\partial_{{\hat{y}^{{\left( {t - 1} \right)}} }}^{2} l\left( {y_{i} ,\hat{y}_{i}^{{\left( {t - 1} \right)}} } \right)f_{t}^{2} \left( {x_{i} } \right)} \right] + \Omega \left( {f_{t} } \right) $$
(9)
where \({\mathcal{L}}^{\left(t\right)}\) is the solvable objective function at the \(t\)-th iteration, \(l\) is a loss function that calculates the difference between the prediction \(\widehat{y}\) of the \(i\)-th item at the \(t\)-th iteration and the target \({y}_{i}\), \({\partial }_{{\widehat{y}}^{\left(t-1\right)}}l\left({y}_{i},{\widehat{y}}_{i}^{\left(t-1\right)}\right)\) is first-order gradient statistics on the loss function and \({\partial }_{{\widehat{y}}^{\left(t-1\right)}}^{2}l\left({y}_{i},{\widehat{y}}_{i}^{\left(t-1\right)}\right)\) is the second-order, \({f}_{t}\left({x}_{i}\right)\) is the increment.
XGBoost is currently one of the most efficient open-source libraries, as it allows for fast model exploration and uses minimal computing resources. These merits led to its use as a large-scale, distributed, and parallel solution in machine learning. Besides, XGBoost generates feature significance scores according to feature frequency use in splitting data or based on the average gain a feature introduces when used during node splitting across all trees formed. That characteristic is of great use and importance for analysing factors that increase PM2.5 concentrations.
Random forests in XGBoost (XGBRF)
Gradient-boosted decision trees and other gradient-boosted models can be trained using either XGBoost or Random Forests. This training is possible because they have the exact model representation and inference, but their training algorithms are different. XGBoost can use Random Forests as a base model for gradient boosting or can be used to train standalone Random Forests. In XGBRF training, standalone random forest is the focus. This algorithm is a Scikit-Learn wrapper introduced in the open-source library of XGBoost, and it is still experimental [48]; this means that the interface can be updated anytime.
Proposed NARX hybrid architecture
Our proposed architecture uses NARX’s non-linear mapping function as a host for machine learning algorithms. As Fig. 6 illustrates, the input features are passed through the pre-processing process, which removes invalid data and normalizes features and converts categorical features to numeric values. Data are then split into training and testing segments. The training segment is the first four years of data, and the testing uses the last year of the dataset described in section “Data Description and Preprocessing”. Then NARX trains the machine learning (ML) algorithm with data in each epoch as defined by its parameters. The system is then evaluated using the fifth-year test data.
The proposed architecture can be described in the following steps:
Performance evaluation
Evaluation metrics
To assess the performance of the prediction model used and reveal any potential correlation between the predicted and actual values, the following metrics are used in our experiments.
Root mean square error (RMSE)
Root mean square error computes the square root of the mean for the square of the differences between predicted and actual values. It is computed as [49]:
$$ {\text{RMSE}} = \sqrt {\frac{{\mathop \sum \nolimits_{i = 1}^{n} \left( {P_{i} - A_{i} } \right)^{2} }}{n}} $$
(10)
where n is the number of samples, \({P}_{i}\) and \({A}_{i}\) are the predicted and actual values, respectively.
RMSE has the same measurement unit of the predicted or actual values, which is in our study μg/m3. The less RMSE value is the better the model prediction performance.
Normalized root mean square error (NRMSE)
Normalizing root mean square error has many forms. One form is to divide RMSE by the difference between maximum and minimum values in the actual data. Comparison between models or datasets with different scales is better performed using NRMSE. The equation used for its calculation is[50]:
$$ {\text{NRMSE}} = \frac{{{\text{RMSE}}}}{{{\text{Max}}\left( {A_{i} } \right) - {\text{Min}}\left( {A_{i} } \right)}} $$
(11)
Coefficient of determination (R
2
)
This parameter evaluates the association between actual and predicted values. It is determined as [51]:
$$ R^{2} = 1 - \frac{{\mathop \sum \nolimits_{i = 1}^{n} \left( {A_{i} - P_{i} } \right)^{2} }}{{\mathop \sum \nolimits_{i = 1}^{n} \left( {A_{i} - \overline{A}} \right)^{2} }} $$
(12)
where n is the records count, \({P}_{i}\) and \({A}_{i}\) are the predicted and actual values, respectively. \(\overline{A }\) represents the mean measured value of the pollutant.
As for the unit of measurement, \({R}^{2}\) is a descriptive statistical index. Hence, it has no dimensions or unit of measurement. If the prediction is completely matching the actual value, then \({R}^{2}=1\). A baseline model where the predicted value is always equal to the mean actual value will produce \({R}^{2}=0\). If predictions are worse than the baseline model, then \({R}^{2}\) will be negative.
Index of agreement (IA)
A standardized measure of the degree of model forecasting error varying between 0 and 1; proposed by [52]. This measure is described by:
$$ {\text{IA}} = 1 - \frac{{\mathop \sum \nolimits_{i = 1}^{n} \left( {\left| {P_{i} - A_{i} } \right|} \right)^{2} }}{{\mathop \sum \nolimits_{i = 1}^{n} \left( {\left| {P_{i} - \overline{A}} \right| + \left| {A_{i} - \overline{A}} \right|} \right)^{2} }} $$
(13)
where n is the samples count, \({P}_{i}\) and \({A}_{i}\) are the predicted and actual measurements, respectively. \(\overline{P }\) and \(\overline{A }\) represent the mean of predicted and measured value of the target, respectively. It is a dimensionless measure where 1 indicates a complete agreement and 0 indicates no agreement at all. It can detect proportional and additive differences in the observed and predicted means and variances; however, it is overly sensitive to extreme values due to the squared differences.
Data description and preprocessing
The dataset used was acquired from meteorological and air pollution data from 2010 to 2014 [21] for Beijing—China, published as a dataset in the University of California, Irvine (UCI) machine learning repository. This dataset was employed just for evaluation purposes, and in the following research, data from Egypt will be used when available from authoritative air pollution stations. The dataset encompasses hourly information about numerous weather conditions, such as (dew point, temperature) °C, (pressure) hPa, (combined wind direction, cumulated wind speed) m/s, cumulated hours of rain and cumulated hours of snow. It also includes PM2.5 concentration in µg/m3. Only cumulated wind speed and cumulated hours of rain, as well as PM2.5, were used in our experiments. All records missing PM2.5 measurements were removed.
Before being used in the chosen prediction algorithms, the dataset was converted into a time series dataset to solve a supervised learning problem [53]. To predict PM2.5 of the next hour, data from the earlier 24 h were used. The transformation was performed via shifting records up by 24 positions (the hours employed as the basis for prediction). Then these records were placed as columns next to the present dataset, and this process was repeated recursively to get this form; dataset (t-n), dataset (t-n-1), …, dataset (t-1), dataset (t). This shifting was used in algorithms that were used independently from NARX hybrid architecture. To evaluate the algorithms properly, K-Fold = 10 splitting method was used. K-Fold splits the dataset records into n sets using n-1 as training and one set as the test in a rotating manner. No randomization or shuffling was used with K-Fold splitting. The input for the LSTM algorithm was rescaled using scikit-learn StandardScaler API [54] using default parameters. Standard Scaler removes the mean and scales to unit variance. To ensure no data leakage [56], scaling and inverse scaling for training set and test set were done separately. Dataset statistics are displayed in Table 2.
Table 2 Dataset statistics