Research framework in general
Our proposed method exploits data from nearby sites to enhance predictions at the target station with missing data. When a target station fails to gather pollutant data from the environment, the neighbouring station data can help to estimate the current loss of the target site. As illustrated in Fig. 1, \(S^3\) fails to collect data and acts as a target station. Neighbouring stations \(S^2\), \(S^5\) and \(S^6\) send their data to \(S^3\). The participating neighbouring stations eligible to send data are chosen based on their coefficient correlations with the target station. We implement a deep autoencoder model at \(S^3\) and use a one-dimensional convolutional neural architecture to cover the spatiotemporal behaviour of pollutant data. Based on the collected spatiotemporal data at target and neighbouring stations, we predict the missing data at the target station.
Figure 2 shows the general research framework used in this study. There are seven main blocks, and each block consists of several tasks. The first block relates to the data sources used in this work. All data sources used in this study are available online, and they can be freely downloaded and used by adhering to the terms described in the given licences. A dataset contains different hourly air pollutant concentrations. Even though the dataset includes several air contaminants, we selected two attributes as the targeted pollutants. Ten monitoring stations are involved in the calculations to acquire the spatial characteristics of air pollutant data. Moreover, we verified our proposed method in three different air quality datasets to achieve less biased results. These are the monitoring of air quality in three major cities: London, Delhi and Beijing.
The data pre-processing in the second block is dedicated to examining the targeted air pollutant coefficient correlation among air monitoring stations. Calculating the coefficient correlation among pollutant concentrations is one of the main steps conducted in this study. For every target pollutant, we joined the same pollutant data taken from all locations into a single data frame and sorted them by the same hourly timestamp. We then calculated the correlation coefficient and selected the three highest correlations between the targeted and neighbouring monitoring stations. Based on these correlations, the data encompassing spatiotemporal characteristics are determined. The spatial behaviour is obtained using data from the targeted and three neighbouring stations (i.e. four monitoring stations in total). The temporal dependency is acquired by collecting the current value and its previously 7-hour values (i.e. 8-hour data in total).
The pre-processing procedure in the third block is carried out to make the spatiotemporal features suitable for the proposed deep learning model. All training and test features are normalised to values between 0 and 1, leading to the data variability reduction [45]. Additional pre-processing in this stage includes initialising missing data because the obtained datasets may contain some missing features. If missing data exist in the original dataset, only an unbroken series of data with a minimum of 1 week (168 hours) period is considered for the training set. We did not remove the remaining data but did not use them during training. Therefore, there are some chunks of unbroken data involved as inputs in the training phase. According to the number of data fragments, the training steps are done in multiple rounds. This step will maintain the temporal behaviour of the time-series data. This step maintains the temporal behaviour of the time-series data. Once we have a clean dataset, we artificially create random and consecutive missing data. The artificial missing values are filled with zeros. The final training and test sets are 3-dimensional matrices with the size of (\(n\times 8\times 4\)). The integer value of n indicates the number of training or test sets, 8 denotes the 8-hour observation period, and 4 denotes the number of features taken from four monitoring stations.
As indicated in the fourth block, we proposed a deep learning model to handle missing data. In this study, the proposed model architecture is a convolutional autoencoder, meaning that the autoencoder uses convolution layers as the encoding and decoding parts. The proposed convolutional autoencoder acts as a denoising model. By replacing some input features on purpose with zeros, the input sets can be seen as corrupted data, and the model learns to reconstruct these corrupted inputs by minimising the loss function. The training process is shown in the fifth block of the research framework.
The sixth and seventh blocks of the research framework are the post-training interpretation and evaluation steps. The model accepts and yields two-dimensional data, and thus post-training output interpretations are needed to find the intended prediction results. This process involves the aggregation procedure. Finally, some evaluation procedures are taken to examine the trained model, such as calculating error metrics, testing the model on different missing rates and locations, and implementing the proposed algorithm on other air quality datasets.
Description of the datasets
This study uses air quality datasets from three different cities. A total of 10 stations are selected for each city, and two pollutants per station are studied. We consider ten monitoring stations adequate for implementing our algorithm and evaluating its performance. We also vary the pollutant in each city to demonstrate that our proposed method can be applied to different pollutants. Some considerations are taken into account when selecting the stations. Availability of pollution data and measurement period for all stations are two of our major concerns. We included stations with at least three years data from the same period. Furthermore, since our method is based on the correlation coefficient between stations, we include stations with varying degrees of correlation.
The first dataset is air pollutant data of London city. The data were collected using the Openair tool [46]. Openair is an R package developed by Carslaw and Ropkins to analyse air quality data. For the London city dataset, we focus on two pollutants: nitrogen dioxide (\(\hbox {NO}_{2}\)) and particulate matter with a diameter of less than \(10\; \upmu \hbox {m}\) (\(\hbox {PM}_{10}\)). We selected ten monitoring stations across London and used data from January 2018 to January 2021.
The second dataset is on India air quality. The dataset was compiled by Rohan Rao from the Central Pollution Control Board (CPCB) website and can be downloaded from Kaggle’s collection [47]. Among many air quality monitoring stations, we selected ten monitoring stations across the city Delhi from February 2018 to July 2020. The chosen pollutants for the Delhi dataset are hourly measurements of \(\hbox {NO}_{2}\) and PM with a diameter of less than \(2.5\;\upmu \hbox {m}\) (\(\hbox {PM}_{2.5}\)).
The third dataset is Beijing multi-station air quality provided by Zhang et al. [48], which can be downloaded from the UCI Machine learning repository page [49]. The dataset contains hourly pollutant data from January 2013 to February 2017. We focused on carbon monoxide (CO) and ozone (\(\hbox {O}_{3}\)) data for the Beijing dataset. We selected ten monitoring stations, namely Aotizhongxin, Changping, Dingling, Dongsi, Guanyuan, Gucheng, Huairou, Nongzhanguan, Shunyi and Tiantan. Table 1 summarises the air quality monitoring stations used in this study.
Table 1 Dataset used in this study Tables 2, 3 and 4 show brief descriptive statistics of London, Delhi and Beijing air quality data. As shown in the tables, four statistics characteristics are shown, namely mean, standard deviation and two quartiles. The mean and standard deviation columns are calculated by excluding the missing values. Standard deviation measures how observed values spread from the mean. A low standard deviation in each station implies that the observed values tend to be close to the mean, whereas a high standard deviation indicates that the observed values are spread out over a broader range from the mean. The quartiles divide the ordered observed values (i.e. from smallest to largest) into four parts. The first quartile (\(25\%\)) is the middle value between the minimum and the median, whereas the third quartile (\(75\%\)) is defined as the middle value between the median and the maximum.
Table 2 Descriptive statistics of London monitoring stations Table 3 Descriptive statistics of Delhi monitoring stations Table 4 Descriptive statistics of Beijing monitoring stations Correlation of pollutant data
The same pollutant data from all monitoring stations are combined, and the coefficient correlation for each pollutant is calculated. For example, if the \(\hbox {PM}_{10}\) is decided as a target pollutant, then we collected all \(\hbox {PM}_{10}\) values from all monitoring stations. Pearson’s correlation is used to find the relation of pollutant data among monitoring stations. Pearson’s correlation measures the linear correlation between two sets of data and can capture the details between trends of two time-series data [19].
Assume that we have a temporal sequence of specific pollutant data in the targeted station as \(\varvec{S}^t=[s_1^t,s_2^t,s_3^t,\ldots ,s_{n-1}^t,s_n^t]\) and a temporal sequence of the same pollutant data at a neighbouring station as \(\varvec{S}^s = [s_1^s,s_2^s,s_3^s,\ldots ,s_{n-1}^s,s_n^s]\). Note that both \(\varvec{S}^t\) and \(\varvec{S}^s\) have the same time frame ranging from sample 1 to n. Then, the Pearson’s correlation coefficient between these two series is described as follows:
$$\begin{aligned} r(\varvec{S}^t, \varvec{S}^s)= \frac{\sum ((s_i^t - \mu _t)(s_i^s - \mu _s))}{\sqrt{\sum (s_i^t - \mu _t)^2 \sum (s_i^s - \mu _s)^2}} \end{aligned}$$
(1)
where \(r(\varvec{S}^t, \varvec{S}^s)\) denotes the Pearson’s correlation coefficient between the time series \(\varvec{S}^t\) and \(\varvec{S}^s\), \(s_i^t\) and \(s_i^s\) represent the i-th samples of \(\varvec{S}^t\) and \(\varvec{S}^s\), respectively. Finally, \(\mu _{t} = \frac{1}{n}\sum\nolimits _{i=1}^{N}s_i^t\) and \(\mu _{s} = \frac{1}{n}\sum\nolimits _{i=1}^{N}s_i^s\) denote the mean values of time series \(\varvec{S}^t\) and \(\varvec{S}^s\), respectively.
In Eq. 1, the numerator is the covariance, a measurement about how series \(\varvec{S}^t\) and \(\varvec{S}^s\) vary together from their mean value. In the denominator, the equation expresses the variance of \(\varvec{S}^t\) and \(\varvec{S}^s\). Correlation is a normalised version of covariance, scaled between -1 to 1 [50]. When \(r = 1\), it is said that \(\varvec{S}^t\) and \(\varvec{S}^s\) are completely positively correlated. When \(r = -1\), \(\varvec{S}^t\) and \(\varvec{S}^s\) are completely negatively correlated. Finally, when \(r = 0\), the linear correlation between \(\varvec{S}^t\) and \(\varvec{S}^s\) is not obvious [51].
Data pre-processing
Spatial characteristics
As depicted in Fig. 2, there are two kinds of pre-processing phases conducted in this study, i.e. block number 2 and number 3. The primary purpose of the first pre-processing phase is to find the pollutant correlations. The pollutant correlation among monitoring stations is utilised to capture the spatial characteristic of air contaminants. For each pollutant, we identified which neighbouring stations have the closest spatial relationship with the station under investigation. In other words, we tried to take advantage of the existing monitoring stations to fill the missing values in the targeted monitoring station. Choosing different kinds of air pollutants will vary the correlation coefficients. Thus, the selected monitoring stations might also differ.
Let \(\varvec{S}^t = \begin{bmatrix} s_{1,1}^t &{} s_{1,2}^t &{} \ldots &{} s_{1,n}^t\\ s_{2,1}^t &{} s_{2,2}^t &{} \ldots &{} s_{2,n}^t\\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ s_{m,1}^t &{} s_{m,2}^t &{} \ldots &{} s_{m,n}^t \end{bmatrix} = (s_{i,j}^t) \in \mathbb {R}^{m \times n}\) be a matrix containing m rows of measurement data and n different pollutants in monitoring station t, where t ranges from 1 to 10. Therefore, we have a pollutant data collection from all stations of \(\varvec{S}^1\), \(\varvec{S}^2\), \(\varvec{S}^3\),..., \(\varvec{S}^{10}\). In this case, each row in matrix \(\varvec{S}^t\) is hourly measurement data. Then, we create a matrix \(\varvec{J} = \begin{bmatrix} s_{1,p}^1 &{} s_{1,p}^2 &{} \ldots &{} s_{1,p}^{10}\\ s_{2,p}^t &{} s_{2,p}^2 &{} \ldots &{} s_{2,p}^{10}\\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ s_{m,p}^1 &{} s_{m,p}^2 &{} \ldots &{} s_{m,p}^{10} \end{bmatrix}\) as a collection of the same pollutant p taken from all stations, where p is a single integer value chosen from 1 to n. The value of p represents the selected column in \(\varvec{S}^t\). In this scenario, we assume that all monitoring station data in the same city have the same column header. Then, we computed the pairwise correlation of columns in \(\varvec{J}\) using Eq. 1, excluding null/missing values. A graphical representation of this process is presented in Fig. 3.
As shown in Fig. 3, we collect the same pollutant for each monitoring station into a single data frame (or matrix) to achieve this goal. For example, when we calculate the correlation of \(\hbox {PM}_{10}\) among stations in the London dataset, we collected \(\hbox {PM}_{10}\) data from CT3, GN5, GR8, IS2, IS6, LB5, LW4, SK6, TH001 and TH002 monitoring stations into a single data frame. Only the targeted pollutant (i.e. \(\hbox {PM}_{10}\)) is selected, and other pollutants are ignored. Before joining the data, we must ensure that targeted contaminants from all monitoring stations have the same time frame. We implemented these procedures using Python programming with the help of pandas library [52].
Once target pollutants have been collected, the Pearson’s correlation calculation can be carried out using Eq. 1. For each station, we then sorted the correlation coefficients from the strongest to the weakest. The obtained coefficients indicate how strong the correlation of the same pollutant between two monitoring station is. Based on this result, the number of monitoring stations involved as input of the proposed model is evaluated. Based on the conducted experiments, we decided to take three neighbouring stations along with the target station. Thus, the input sets of our proposed model will have four columns. We will explain the process of deciding the number of monitoring sites in Sect. 3.3.
The second phase of the pre-processing blocks (i.e. the block number 3 in Fig. 2) is dedicated to capturing the temporal characteristic of the pollutants, conducting the perturbation procedure and creating input sets suitable for the proposed deep learning model.
Temporal characteristics
Besides involving spatial characteristics, this study also tries to capture the temporal behaviour of the pollutant data. The temporal behaviour describes the dependency among pollutants at different times [53]. In this study, we calculate the autocorrelation coefficient of the contaminant under investigation using Pearson’s correlation. We computed this correlation between the series of targeted pollutants and its shifted self. Thus, instead of calculating the correlation between two different time series, the autocorrelation computes the relation between the same time series at current and lagged times. Given time-series of pollutant data at the target station \(\varvec{S}^t=[s_1^t,s_2^t,s_3^t,\ldots ,s_{n-1}^t,s_n^t]\), we can rewrite equation 1 to find the lag-k autocorrelation function as:
$$\begin{aligned} r_{k}= \frac{\sum _{i=k+1}^{n}((s_i^t - \mu _t)(s_{i-k}^t - \mu _t))}{\sum _{i=1}^{n}(s_i^t - \mu _t)^2} \end{aligned}$$
(2)
where \(r_{k}\) denotes the autocorrelation function, k is lag, \(s_i^t\) and \(s_{i-k}^t\) represent the i-th and lag-k samples of \(\varvec{S}^t\), and \(\mu _{t} = \frac{1}{n}\sum\nolimits _{i=1}^{N}s_i^t\) denote the mean values of time series \(\varvec{S}^t\).
In this study, we use 8-hour as the length of pollutant data. This length is obtained by computing the lag-k autocorrelation. The value of k will determine the size of input data. As discussed in Sect. 3.2.2, we determine \(k = 7\). Please note that the value of time lag is started at 0 (or \(k=0\)). The value of \(k=7\) means that we use 8 data in total. In other words, to find a single prediction, we use current and seven previous observed data as the input for our proposed model. To conclude, by involving the pollutant data from targeted and three other neighbouring stations (i.e. spatial consideration) and including current and seven previous data (temporal consideration), the final input for the proposed deep learning model will have a size of \(8\times 4\).
Missing data and perturbation procedure
Another pre-processing step carried out in this study is to handle the initial missing data in the original datasets. Missing values occur both in the form of discontinuous and consecutive missing patterns. As our proposed model is trained in a supervised manner, we have to provide input-target pairs. The model fits on the given training data consisting of input and target sets. While deleting missing data are a straightforward procedure, we avoid this method as this method can break the data structure, and valuable information may be lost. To minimise the defect of the original data structure, we carefully picked the series of data with a minimum period of one week (168 hours). As our input sets are comprised of pollutant data from multiple monitoring stations, the minimum one-week selection is applied only for the target station. We let the other station periods comply with the target station period.
Figure 4 illustrates this idea. The shadowed areas indicate the period of the observed pollutant without missing values, whereas the white strips indicate the missing values. Based on the target station data, a minimum of 168 hours intervals without missing data were selected. The same selection periods were also applied to the neighbouring stations to maintain the consistency of the time frame between monitoring stations. After completing these steps, the target station will have no missing data. However, unlike the target station, there is a possibility that missing values exist in the neighbouring station parts. To overcome this issue, we filled the missing values with zeros.
To train the proposed model, we need pairs of input and target sets. Since the target station data contain no unknown values, the actual targets for all input sets can be provided. The perturbation procedure was carried out to reflect the missing values phenomena and train the proposed model. Some values in input sets were intentionally removed, and all deleted values were filled with zeros. In this scenario, the errors were generated in the correct dataset to evaluate the performance of the proposed imputation method [54]. Short-interval and long-interval consecutive missing patterns were applied to the input sets and let the model adjust its parameters to minimise the loss function.
For the short-interval perturbation procedure, different levels of missingness were applied to the input sets. Following the work conducted by Hadeed et al., four missing rates (i.e. \(20\%\), \(40\%\), \(60\%\) and \(80\%\) of missing rates) were set for the target station [25]. While the missing rate was varied for the target station, a fixed missing rate was applied to the neighbouring stations during the training and testing phases of the proposed model. The missing rate of \(20\%\) was considered as an error probability for the neighbouring stations [54]. Due to the initial zero imputation illustrated in Fig. 4, the neighbouring stations will have more than a \(20\%\) missing rate after the perturbation procedure.
For the long-interval perturbation procedure, a maximum of 500 hours of consecutive values was removed from some parts of the correct dataset. The successive missing periods were varied between 100 and 500 hours. This procedure was implemented only to the target station, and we let the neighbouring stations follow the short-interval process described previously. Figure 5 illustrates input set perturbation patterns for the input sets. It can be seen that both short- and long-interval missing patterns were generated only for the data in the target station, and a minimum of \(20\%\) missing rates was applied to all neighbouring stations.
Model input construction
Input sets resulting from the perturbation process are ready to be normalised. Once the normalisation step is completed, the model input construction can be performed. The current missing value is predicted using the current initial imputation (i.e. we filled the current missing value with zero) along with the last 7 hours data. As illustrated in Fig. 6, the dataset contains air pollutant data over a sample row from \(t = 1,\ldots , T\), and the rolling window size is m. The input sets for the model are obtained by shifting the pre-processed dataset. We take 8 hours of data and shift the features by one hour to get the next input set. This process is similar to the rolling-window scheme. In our case, the increment between successive rolling windows is one period.
The proposed model acts as a denoising tool as the missing values are intentionally generated from the complete dataset, and the given target is the complete dataset itself. Thus, our proposed model can be called a denoising autoencoder [31]. Given the noisy inputs, the autoencoder model will reconstruct these inputs. Based on this concept, the imputation of missing values that exist in the given data is performed.
Proposed model
Convolutional autoencoder architecture
In this study, a convolutional autoencoder model is proposed to learn the missing patterns from the given corrupted input sets and the provided actual sets. The proposed model architecture is shown in Fig. 7. The autoencoder model accepts the collection of input sets in the form of \(8\times 4\) matrices. The individual input comprises four columns of pollutant data, a group of hourly targeted pollutant concentrations from four monitoring stations, and eight rows that indicate 8-hour of observed data. We purposely corrupted the input sets by deleting the actual values and filling them with zeros to train the model. The input columns represent spatial behaviour, and the rows capture temporal characteristics of air pollution features.
The autoencoder contains encoder and decoder parts, and both sections are based on one-dimensional convolution layers. The encoder is made up of convolution layers, while the decoder consists of transposed convolution layers. The proposed model receives only eight values as the feature’s length, so we need to utilise a small kernel to obtain more detailed input features. In this case, the kernel size equal to two is applied to all layers. The kernel’s size specifies the 1D convolution window operated in each layer, and it is used to extract the essential input features. In this model, no padding is implemented in each layer.
As illustrated in Fig. 7, the size of layers in the proposed model changes, both in height and width. The width of the following layers is controlled by the number of filters used in the previous layer. After various experiments, we determined the number of filters used in the proposed model, as presented in Table 5. The encoder has different output filters, from 80 in the first layer to 10 in the fifth layer. From the latent space, the number of the filter is expanded from 20 in the sixth layer to 80 in the ninth layer. Finally, we set the final layer with 8 output filters to get the equal size of reconstructed inputs (i.e. \(8\times 4\) matrices).
Table 5 Layer properties of the proposed autoencoder model Model configuration and training
The proposed autoencoder model was built using Tensorflow CPU version [55] and written in Python. Some powerful Python libraries were also utilised, such as Keras [56], pandas [52], NumPy [57], scikit-learn[58], Matplotlib [59] and seaborn [60]. In this work, we used a local machine powered by \(\hbox {Intel}^{\textregistered }\) \(\hbox {Core}^{\mathrm{TM}}\) i7-8565U CPU (4 core(s), @1.80GHz), 8 GB installed RAM and Windows 10 as the operating system.
After creating the model architecture, we configured the model for training. We selected Adam [61] as the optimiser with a learning rate of 0.001. The program computed the mean squared error (MSE) values between the given target and prediction during training. MSE was the selected loss function to optimise the model during training and the metric to judge the model’s performance. About 75% of data were used as training sets and the remaining data as testing sets. We used 32 as the batch size (i.e. number of samples per gradient update) and implemented an early stopping method to finish the training. The training process is terminated when there is no loss score improvement in three consecutive iterations.
Figure 8 illustrates the training process of the denoising autoencoder. Define the encoder function \(f_{\theta }(\cdot )\) parameterised by \(\theta =\{\mathbf {W},\mathbf {b}\}\), and decoder function \(g_{\phi }(\cdot )\) parameterised by \(\phi =\{\mathbf {W'},\mathbf {b'}\}\), where \(\mathbf {W}\), \(\mathbf {W'}\), \(\mathbf {b}\) and \(\mathbf {b'}\) represent the weight and bias of the encoder and decoder, respectively. Thus, we define the encoder function as \(\mathbf {h}=f_{\theta }(\mathbf {x})\) and the decoder function as \(\mathbf {r}=g_{\phi }(\mathbf {h})\), where \(\mathbf {x}\) is the input, \(\mathbf {h}\) is the code representation learning, and \(\mathbf {r}\) is the reconstructed input. The perfect condition for model learning is to set \(g_{\phi }(f_{\theta }(\mathbf {x}))=\mathbf {x}\). However, the model cannot learn perfectly but instead tries to minimise the error between the actual input and the reconstructed input [62]. Then, for each training set \(\mathbf {x}^{(i)}\), the parameters \(\theta\) and \(\phi\) are optimised to minimise the average reconstruction error [31]:
$$\begin{aligned} \theta ^{*},\phi ^{*}=\underset{\theta , \phi }{\arg \min }\frac{1}{n}\sum _{i=1}^{n} L(\mathbf {x}^{(i)},g_{\phi }(f_{\theta }(\mathbf {x}^{(i)}))) \end{aligned}$$
(3)
where L is the model loss function. The typical loss function is squared error \(L(\mathbf {x},\mathbf {r})=\Vert \mathbf {x}-\mathbf {r}\Vert ^2\). For the denoising autoencoder, instead of \(\mathbf {x}\), we define \(\widetilde{\mathbf {x}}\) as the noisy input of \(\mathbf {x}\) [62]. Thus, the loss function of the denoising autoencoder is rewritten as:
$$\begin{aligned} L(\theta , \phi )=\frac{1}{n}\sum _{i=1}^{n}(\mathbf {x}^{(i)} - g_{\phi }(f_{\theta }(\widetilde{\mathbf {x}}^{(i)})))^{2} \end{aligned}$$
(4)
Post-training outputs interpretation
The test sets are fed to the model after the training process has been completed. The model accepts \(8\times 4\) input sets and yields outputs of the same size. As the trained values are scaled into [0,1], the output values must be transformed back to their original values.
After we undo the scaling of model output, we determined the single prediction for each hour. As illustrated in Fig. 9, the autoencoder produces overlapping outputs for a certain period of prediction. We aggregated the values by calculating the means of all overlapped output sets to give a single point estimate. As the targeted results are located in model outputs’ first columns (target station), we can calculate the means only for the first columns of the output sets. These processes are systematically presented in Algorithm 1 in Appendix A.
Model evaluation metrics
There are several methods usually used to evaluate the model performance. In this study, root mean square error (RMSE) and mean absolute error (MAE) are used, following work done by [63]. Another broadly used method to evaluate model performance in machine learning studies is the coefficient of determination (\(R^2\) or R-squared). Chicco et al. [64] suggested implementing \(R^2\) for regression task evaluation as this method is more informative to qualify the regression results. However, the limitation of \(R^2\) arises when the calculated score is negative. In this case, the model performance can be arbitrarily worse, but it is impossible to recognise how bad a machine learning model performed [65].
Following work conducted by Ma et al. [19], we implemented a rate of improvement on RMSE (RIR) to measure the performance of our methods in comparison with the existing imputation techniques. The RIR is calculated using the following equation:
$$\begin{aligned} RIR^{A,B} = \frac{RMSE^A - RMSE^B}{RMSE^A} \times 100\% \end{aligned}$$
(5)
where \(RMSE^A\) denotes the RMSE value of the benchmarked method and \(RMSE^B\) is the RMSE value of our proposed method.
In addition to RMSE, MAE, \(R^2\) and RIR, in this study, visual comparisons of actual and imputed data in the forms of line, bar or box plots were also presented to describe the model performance more intuitively.