A reinforcement learning-based approach for imputing missing data

Missing data is a major problem in real-world datasets, which hinders the performance of data analytics. Conventional data imputation schemes such as univariate single imputation replace missing values in each column with the same approximated value. These univariate single imputation techniques underestimate the variance of the imputed values. On the other hand, multivariate imputation explores the relationships between different columns of data, to impute the missing values. Reinforcement Learning (RL) is a machine learning paradigm where the agent learns by taking actions and receiving rewards in response, to achieve its goal. In this work, we propose an RL-based approach to impute missing data by learning a policy to impute data through an action-reward-based experience. Our approach imputes missing values in a column by working only on the same column (similar to univariate single imputation) but imputes the missing values in the column with different values thus keeping the variance in the imputed values. We report superior performance of our approach, compared with other imputation techniques, on a number of datasets.


Introduction
Missing data is a common problem in real-life datasets. Missing data is caused by incomplete/no measurements due to human/system errors, data corruption, and privacy concerns of users filling data for surveys. Missing data hinders the data analysis because most of the analytical approaches cannot straightforwardly work with incomplete data [41]. Usually, the data are pre-processed to overcome this problem. As such, the goal of data pre-processing is to produce a high-quality dataset without missing values. Such preprocessing techniques include imputation, a term used for handling missing values by replacing missing data with substitute values. Given the relevance of missing data in real-life datasets, missing value imputation has received considerable attention and many imputation methods have been proposed in the literature [17,21].
Missingness in data can be categorised as [23]: Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Data are classified as MCAR if the missingness of the data occurs entirely at random (with no dependency on other variables). An example of MCAR is a data collected by failed equipment. This type of missingness is not biased towards any factor/variable and hence does not affect the data analysis. Data are MAR when the probability of the missing data depends on the set of observed data values (e.g. the observed values in other columns of tabular data). For instance, older patients might be more likely to forget a data value (hence missing data), than the younger patients. MNAR occurs when the missingness probability depends on the incomplete variables (means the missingness cannot be explained from the observed variables). For example, people with higher income are less likely to reveal it. In this case, an incomes value is missing because it was too high (the reason for a missing column value is associated with the same column). Various strategies are employed to deal with each type of missing data [15]. In this paper, we introduce missingness in the data by randomly removing values across the data, thus our data are MCAR in this work.
A popular approach to deal with missing data is the complete case analysis [13]. This approach considers only those data observations which have no missing values and deletes all the other data. This approach may lead to a substantial loss of important information since in many cases: (1) the data may only be missing from only a few attributes (data from other attributes/columns can be useful, while it is deleted in the complete case analysis), and (2) the data may be missing from a large number of samples (a large amount of (row) data are deleted in this case) contains a large number of missing values. Another commonly used approach, called hot deck imputation, fills the missing values with random values picked from similar non-missing samples [2]. The main drawback of this approach is its lack of ability to preserve the covariance structure in the imputed data [16].
A number of techniques have been applied to solve the missing data imputation problem. There are two main types of imputation techniques: single imputation and multiple imputation [8]. Single imputation approaches estimate the missing values in the data only once, while multiple imputation approaches produce multiple datasets each with an approximation/estimate of the missing values, and the results from all the imputations are consolidated in the final stage to infer the missing values. The single imputation approaches can broadly be categorized as [13]: (1) univariate single imputation approaches such as ad-hoc imputation, nonresponse weighting, and likelihood-based methods; and (2) multivariate single imputation approaches such as k-Nearest Neighbours (kNN), and Random Forests (RF)-based imputation. The univariate imputation approaches replace missing values in a column (of a tabular data) by using the observed values in the same column, whereas the multivariate imputation approaches use the observed values in other columns of the data to estimate the missing values of a column of the data.
Univariate single imputation approaches, in general, impute the missing values in a column of data by using only non-missing values from the same column. The ad hoc imputation aims at maintaining the full data sample by filling the missing values with estimated values. The missing values (in a column of tabular data) are estimated with a single value, in this approach, such as mean or median of the corresponding data feature [20]. The nonresponse weighting approach also estimates a single value. However, the imputed value, in this case, is a weighted estimate of the population mean or median. The weight is determined by the ratio of the number of samples in each group of data. This approach is suitable only when the data population contains majority samples from one group and a few samples from the other groups. The single imputation approaches fill all the missing values (in a column of tabular data) using only one value, which generally underestimates the errors of data imputation. The likelihood-based methods aim at modelling the missing data mechanism by maximizing the likelihood function of the data [14]. Once the parameters of the likelihood function are estimated, the missing values are produced based on these parameters.
Multivariate single imputation approaches use all the available data across the columns to estimate the missing values. Machine Learning (ML) techniques such as k-NN and RF have been used to address the missing data problem by learning the hidden patterns in the data [19,33]. Besides the added time complexity of the ML-based approaches, in general, the kNN approach is known to be sensitive to outliers, requires a careful selection of the parameter 'k', and is imprecise in imputing variables which have no dependencies in the dataset [4]. The RF method is known to have biased results at the extreme values of the continuous variables [26].
Multiple imputation replaces the missing values with a set of plausible values by predicting the missing values using the existing data from the other variables [29]. This approach maintains the natural variability and uncertainty in the predicted values. In multiple imputation techniques, the imputation process is iterated several times, each time creating a completed dataset. The completed dataset is then analysed using statistical analysis to generate results. Subsequently, the averaged results are reported. Examples of multiple imputation include techniques based on joint modelling [25], and fully conditional specifications [35]. The former approach assumes a normal distribution of incomplete variables for imputation, while the latter imputes missing values based on univariate conditional distributions for each incomplete variable given other variables. Despite its sophistication, multiple imputation at times underperforms compared to the simpler (single) imputation approaches [10,31]. This observation motivated us to focus on single imputation in this work.
Reinforcement Learning (RL) is a type of ML that has robust characteristics to handle the optimization problems by exploring the environment which is formed based on the problem and data. RL enables an agent to learn the best sequence of decisions, through a series of actions and rewards, to achieve an ultimate goal in an environment. We believe that the missing data can be imputed using an RL agent, capable of performing the most suitable action at the right time, to best achieve the goal of approximating the missing data. Therefore, in this work, we propose an RLbased approach to impute missing data, in real-world datasets. Our proposed approach is a univariate single imputation approach. The key aspect of our approach is its ability to estimate missing values without neglecting the variance of the imputed variable, as in the case of conventional univariate single imputation approaches.

Related work
A brief categorization of missing data imputation techniques is shown in Fig. 1. Missing data imputation approaches are broadly classified as single imputation approaches, and multiple imputation approaches. A detailed description of these approaches is presented as follows.

Single imputation approaches
Single imputation approaches estimate each missing value in data with only one value. There are two broad categories of single imputation approaches: univariate single imputation and multivariate single imputation. The detail of these approaches is given as follows.

Univariate single imputation
The most common approach of missing data imputation is the univariate single imputation. Univariate single imputation approaches estimate the missing values, in a column of data, by using the available values from only the same column. Therefore, all the missing values in a column of data are replaced with exactly the same value. Figure 1 shows a number of univariate single imputation approaches, which impute the same value for each missing value in a column (of a tabular data), including the mean, median, most frequent value, and the last observation carried forward imputations [20]. The mean-and medianbased imputation approaches impute the missing values in a column with the mean and median of the available values in that column, respectively. The most frequent valuebased imputation replaces missing values in a column with the most common value in that column of data. The last observation carried forward imputation replaces missing values with the last observed values. While these approaches are used frequently, they discard the variance of the imputed values, since all the missing values in a column of data are replaced with the same value. These approaches are rigid and are likely to distort the distribution of the imputed variables [18].

Our approach
Our proposed univariate single imputation approach uses a single column during imputation, but estimates dynamic values for each missing value in a column (of tabular data). To the best of our knowledge, this is the first attempt towards a univariate single imputation approach, which replaces missing values in a column while maintaining the variance in the estimated values. The policy learnt by our RL agent guides the imputation process to impute a missing value by using the available values in only the same column of data. A detailed description of our approach is presented in Sect. 3.2.

Multivariate single imputation
Multivariate single imputation approaches estimate missing values in a column of data, by using the available data in the other columns. These approaches estimate the missing values in a variable by using the relationship among the available data of the other variables. Figure 1 shows a number of multivariate single imputation techniques. One such technique poses missing data imputation task as a matrix completion problem [6]. This technique comprised of a first-order algorithm to fill in the missing entries in low ranked matrices with a minimum nuclear norm. Literature [5] proposed iterative imputation of the missing values of each feature by regressing the values of the remaining features. All these methods are linear in nature, which may not be able to capture the nonlinear relationships between the observed and missing values.
ML techniques such as kNN have also been used to impute missing data [19,20]. In kNN-based imputation, each missing value is replaced with a value obtained from the related observations of the available dataset. Although this approach is considered an efficient method to fill in the missing data, it tends to distort the true distribution of the data [4].
A proximity matrix is also used to impute missing data using RF [33]. In this technique, the data are first imputed using median (for continuous variables) and the most frequently occurring value (for categorical variables). Then, an RF is generated using the filled data and a proximity matrix of size n Ã n created, where n is the sample size (number of rows in a tabular data). This proximity matrix is then used to impute the originally missing data. The updated data are used to grow another RF and the process is repeated.
Some works have used autoencoders to impute missing data [11,34]. Gondara et al. proposed a multivariate imputation technique based on deep denoising autoencoders [11]. However, this approach assumed that there is enough complete data to train a model, which might not be the case in real-world datasets. Tran et al. cascaded a series of residual autoencoders to learn the complex relationship from data of different modalities to impute the missing data [34]. This approach combined the strengths of residual learning and autoencoders. Although the autoencoders are empirically effective, these imputation approaches based on autoencoders are heuristic based and it is unclear what mathematical objective is defined for the missing values.
Instead of generating candidate values for the missing data, Smieja et al. presented a general approach to make neural networks process the incomplete data by building a probabilistic model of the incomplete data [28]. Their approach replaced the typical neuron's response in the first hidden layer of a neural network with its expected value to achieve more generalized and accurate activations of the neurons and improve the imputation performance.
A modified Radial Basis Function (RBF) was proposed to generalize the standard Gaussian RBF kernel of Support Vector Machines (SVM) to suite incomplete data [27]. This approach uses the characteristics of the data distribution to model the uncertainty of the missing data to serve for data imputation.
A modified Generative Adversarial Network (GAN) was proposed by Yeh et al. to fill in the missing regions in natural images (known as inpainting) [38]. Their approach was able to learn the representations from the training data and predict the missing patches by using meaningful context. This approach, however, requires complete data in the training phase which is not common in real-world datasets. Since an image is represented with a matrix (or a table) of values (where a value might represent the intensity value of a pixel), it is similar to having a tabular data which does not represent images. Hence, these approaches can also be applied to tabular data.
A Denoising Auto-Encoder-based approach was proposed to impute missing values [24]. Their approach deleted some new missing values in those samples which already have missing values. This extra deletion allowed to better reconstruct the incomplete data by training autoencoders. Their work also introduced a compensation strategy, by adding a balancing parameter in the loss function, to minimize the imbalance in data which was created by the deletion step. Their method achieved similar imputation performance compared with the Multiple Imputation by Chained Equations (MICE), a popular multiple imputation approach.
Popular Generative Adversarial Networks (GANs) [12] have also been used to impute missing data. GANs are a type of machine learning algorithm with generative and discriminator parts, both working in an adversary manner.  to the generator to allow it to produce more real-like examples, in an effort to deceive the discriminator. Yoon et al. proposed a GAN-based method, named GAIN, to impute missing data. Their generator completes the missing values given the observed ones, and the discriminator aims to distinguish between true and imputed values [39]. Recently, Awan et al. proposed a class-specific distribution by adapting the popular conditional generative adversarial networks to impute the missing data. Their approach learns class-specific probability distributions in the training phase which allows to impute the missing values more precisely than the GAIN approach [3].

Multiple imputation approaches
Multiple imputation tries to restore the natural variability in the imputed values. This approach first produces n copies of data [n is typically in the range 5-10 [30]] by imputing missing values in the data n times using a multivariate single imputation approach. Then, each copy of the data is analysed using a standard method (e.g. regressor or a classifier) for complete data. Finally, the results from the analytical method are combined to achieve statistical inference reflecting the uncertainty due to the missing values [22]. MICE is a commonly used approach to generate imputations based on a set of imputation models, one for each variable with missing values [37].

The proposed method
The goal of an RL approach is to train an agent, to take decisions at any stage in an environment, to achieve a goal using rewards and punishments. In our work, we aim to train an agent, using RL, to estimate multiple values of missing values in a column of data. Our agent learns to take a series of decisions to make the best estimate of the missing values. A detailed description of RL and our proposed approach is given in the following sections.

Concept of Reinforcement learning
RL is a machine learning method which is concerned with how an agent should react in an environment. The goal of RL is to train an agent to take a sequence of decisions, using a system of rewards and penalties, to solve a problem by itself. RL achieves its purpose by emulating a scenario and noting the corresponding response of the agent. The agent is rewarded if the response is the desired one and penalized otherwise [32]. Therefore, the next time the agent faces the same situation, it executes a similar action with even more confidence to collect more rewards. Hence, the agent learns ''what to do'' from good experiences, and ''what not to do'' from bad experiences.
RL is widely used in robots nowadays, which play a vital role in various applications, such as agriculture, manufacturing, customer service, and health care. Robots in health care provide patients support and assistance in critical situations. These robots are trained by RL, which allows them to learn according to the patients' needs [1].
The core features of a RL paradigm (see Fig. 2) are as follows: -Observation of the environment: an agent is exposed to the environment. -Finding yourself in the environment: the situation of the environment that the agent faces, called a state. -How to act using some strategy: the agent reacts by performing an action to evolve from one state to another. -Receiving a reward or penalty: After the transition, the agent may receive a reward or penalty in return. -Learning from experiences: To create a policy, which is the strategy of choosing an action given a state to achieve better outcomes.

Our proposed approach
Our proposed RL-based approach for missing data imputation is based on the Quality-learning (known as Qlearning approach) [36]. In our RL approach, an agent learns an optimal action-selection policy, from its interaction with the environment, using a Q function [36]. An episode of environment interaction is recorded as (s, a, r, s') using the initial state of the agent (s), the action taken by the agent (a), the reward offered for this action (r), and the resultant state of the agent (s'). Our agent maintains a where t represents the current time step, and t þ 1 is the next time step, a is the learning rate ð0\a 1Þ which determines the amount of update to be made in Q-values in each iteration, c is the discount factor ð0 c 1Þ which controls the importance given to future rewards. R(s, a) is the current reward for performing an action in the current state. The term max Q t ðs 0 ; aÞ is the current Q-value estimate of the next best action to be picked. Equation (1) updates the Q-value of the agent's current state and action, by adding the learned value. The learned value is a weighted combination of the reward for taking an action in the current state, and the discounted maximum reward from the next state. This approach motivates the agent to collect maximum rewards and in doing so, learn the best actions to take in a state. The objective of Eq. (1) is to learn a policy to reach the state of lowest error ðs 0 Þ from any other state ðs 1 À s 9 Þ. The Q-values are repeatedly updated using Eq. (1) until a policy is learnt (1000 repetitions in our work). We initialize the Q-table with zeros initially, which represents the learning of a policy from scratch. Next, an action is chosen from the Q-table and performed using an epsilon greedy strategy. Initially, the values of epsilon are large and the agent explores the environment by choosing actions randomly. The epsilon value gradually decreases and the agent starts to exploit the environment with its experience. In our work, the agent had two actions to choose from: increase the estimated value, or decrease the estimated value. Observe the current state s 10. Repeat 11.
Select and perform an action 12.
Update the imputed value based on the action 13.
Observe the reward R(s,a) and the new state s 14.
In the training phase, the agent knows nothing about the environment initially (i.e. where to look for the best estimate of the missing value). Gradually, the agent learns the manoeuvring and saves it as a policy in the Q-table. Once the Q-table is ready, the agent can start to exploit the environment by taking better actions in each state.
Our proposed RL-based approach for imputing missing data is shown in Fig. 3. The process starts with imputing a singular value for each missing data (for example, the mean value of this column). This approximation determines the state of the imputation, based on the error (how close/far the imputed value is from the ground-truth value). The RL model guides the next imputation value such that the imputed value is pushed towards the state of lower error. Each transition between the states updates the imputation value. At the end of this process, we achieve an imputation value which is very close to the ground truth.

A Markovian formulation of our approach
A Markov Decision Process (MDP) consists of states, actions, rewards, and transitions between the states. In our approach, the set of environment states S is defined as a finite set fs 0 ; s 1 ; . . .; s N g, where N is the size of the state space S. The size of the state space S is a hyper-parameter, empirically chosen as 10. A state, in our work, is a measure of how far an estimated value is from the actual value. The set of actions A is a finite set fa 1 ; a 2 ; . . .; a K g where K is the size of the action space. An action a 2 A applicable to a state s 2 S is denoted as A(s), where AðsÞ 2 A. Each action is used to control the environment's state. In this work, our agent picks one out of two actions, i.e. increase the estimated value or decrease it. By applying an action a 2 A in a state s 2 S, the environment transitions from state s to a new state s 0 2 S. The reward function specifies rewards for being in a state, or doing some action in a state. Our reward function is formally defined as R : S Â A Â S ! R and represented by the Q-matrix.
An MDP is a sequence of tuples ðs; a; r; s 0 Þ. These sequences of transitions define the model of the MDP. A Fig. 3 A Markovian depiction of our RL approach to impute missing values. The training aims at reaching minimum error state, S 0 , to minimize the imputation error pictorial depiction of MDPs is shown in Fig. 3, where the nodes correspond to states and directed edges represent the transitions. Given the MDP, a policy function p outputs for each state s 2 S an action a 2 A. The training begins with a start state, e.g. s 0 , then the policy p suggests an action a 0 , which is performed. A new state s 1 is achieved with this transition and a reward r 0 is collected. This process continues producing s 0 ; a 0 ; r 0 ; s 1 ; a 1 ; r 1 ; s 2 ; a 2 ; r 2 ; . . ., etc., and ends when a goal state, in our case s 0 , is achieved. The same process is then repeated with a new start state. The learnt policy becomes part of the agent and helps it to control the environment modelled as an MDP.

A toy example
We present a toy example to demonstrate our proposed approach for imputing missing data. We use the data given in Table 1 as our reference data. The data contains 10 instances of data, each having 4 columns.
We randomly delete 10% of the total data to create missing values (see Table 2). The classical univariate single imputation approaches such as mean, median, and the most frequent value estimate the missing values using statistical measures. A comparison of the statistical-based estimated values and our proposed approach, for the missing values in each column of the toy example data, is given in Table 6 (discussed at the end of this section) ( Table 3).
Our RL-based approach starts with learning the policy matrix (Q-matrix) for imputation. For this purpose, we initialize an n Â n matrix of zeros, where n represents the number of states. In this example, we empirically select n to be 10. Each state is based on the error of the imputed value compared with the ground truth value. The rewards matrix R is of the same size as Q. R-matrix contains zero if the path between the corresponding states is viable, and -1 otherwise (path seen in Fig. 3, see Table 4 for R matrix). The error decreases going from state nine (s9) towards state zero (s0) and vice versa, and our goal is to reach the state with minimum error (s0). Therefore, the path of the goal state is set to 100.
We obtained our trained Q-matrix (shown in Table 4) after 1000 iterations. This matrix contains the policy in the form of a sequence of steps going from a state of higher error to a state of lower error. For each current state (row of the Q-matrix), the column which contains the maximum value is the policy for the next state. Once the Q-matrix is ready, the missing values can be estimated by following the policy given by the Q-matrix and update the estimated value accordingly. The policy is derived from the current state of the agent, followed by the sequence of steps to reach the state zero (s0). This process is presented in Table 6 for the missing values in column 2 of our toy example data. The example shows that our proposed approach imputes the two missing values in column 2 of our toy example with two different values. This is a key advantage of our approach since the conventional univariate single imputation techniques lacked variance in the imputed values. A brief description of the process is as follows: The imputation process starts with an initial estimate of the missing value (column mean in this example). Then, we   calculate the error between the estimated value and the ground truth value. The new state of the agent is calculated based on this new imputation error. Then, the Q-matrix is used to get the next move, and the imputation value is updated accordingly. The new value is a weighted update of the current value based on a weight parameter (r), i.e. value new ¼ value old ð1 þ rÞ. The sign in this equation is governed by the policy learnt during the training phase. We keep the r at 0.01 for our toy example. The update in the estimated value is repeated until we reach the state with the minimum error, i.e. s0. The estimated value at that point is taken as the imputation value based on our approach. Table 5 presents a detailed calculation of the imputed values of ''Col 2'' of the toy example, based on our proposed approach. Table 6 compares the imputation based on mean, median, the most frequent value, kNN, RF, and our proposed approach, on the missing values in each column of the toy example. The first missing value in ''Col 2'' of the toy example (original value of 0.39) is estimated as 0.44, 0.397, 0.260, 0.463, and 0.330 using mean, median, and the most frequent value, kNN, and RF, respectively. Our proposed RL-based approach imputes this missing value with 0.396. The same mean, median, and the most frequent values are imputed to the second missing value in ''Col 2'' (original value of 0.460). The kNN and RF-based imputations impute 0.353 and 0.398, while our proposed approach imputes it with 0.452, which is a better approximation of the original value. Table 6 shows that our proposed RL-based approach outperforms the other imputation approaches on the toy example data.

Datasets
We used eight publically available datasets from the UCI Machine Learning Repository [9]. These datasets have been previously used in the literature, e.g. [40]. The details of these datasets are given in Table 7. The Breast Cancer dataset contains features, from digitized images, representing characteristics of the cell nuclei such as radius, texture, perimeter, and others. The Vehicle dataset is a classification dataset having features extracted from the silhouettes of vehicles. These features include variance, skewness, and kurtosis among others. Travel dataset contains features that represent the feedback of customers of the Trip Advisor company. The Spambase dataset is a classification dataset whose features come from a collection of emails. The features mostly contain information such as the percentage of occurrence of a specific word in an email, and the length of sequences of consecutive capital letters. Parkinson dataset is composed of features representing voice measurements of healthy and Parkinson disease patients. Letter recognition dataset contains features from rectangular images representing 26 capital letters in the English alphabet. Default credit card dataset is also a classification dataset representing the possibility of default of a customer. The default of a customer is approximated with age, amount of given credit, history of past payments, and other features. News popularity dataset contains statistics of online news articles. These statistics include the number of words in the title, number of hyperlinks in the article, the average length of words, and others.

Performance metrics
The performance metrics, used in this work, to compare our proposed approach for missing data imputation with other available approaches are the Mean Absolute Error (MAE) and the Root Mean Squared Error (RMSE). These are the most commonly used metrics to estimate the performance of the missing data imputation approaches [7]. MAE is the mean of all the absolute errors between the imputed and ground truth values, as given in Eq. (2).   Table 5 Imputation of missing value using our proposed approach on Col 2 of the toy example data in Tables 1 and 2 Ground truth = 0.390 r = 0.01 Policy ( where x i is the ground truth value,x i is the predicted value, and N is the total number of errors. RMSE, given in Eq. (3), represents the square root of the average of the squared differences between the imputed and the ground truth values. While the MAE represents a generic estimate of how far off our imputed values are from the ground truth values, the RMSE is more conscious of the points further away from the mean. This suits us since we want our imputed values to come into the closest-possible vicinity of the ground truth values.

Experimental setup and results
All the experiments in this work were implemented using Python 3.5, and Scikit-learn 0.22.1. The data were divided into 70% and 30% portions for training and testing, respectively, for each experiment. We created randomly missing data with 5%, 10%, 15%, and 20% proportions across all data in the datasets. All the missing values were replaced with 'nan' during the process. The hyperparameters of Q-learning, such as alpha and discount factor, were selected based on a grid search, in our work. The search spaces of alpha and discount factor were empirically selected as f0; 0:001; 0:002; . . .; 0:5g and f0:9; 0:91; 0:92; . . .; 0:99g, respectively. For our approach, the Q-learning was performed over 10,000 iterations to learn the policy for missing data imputation. The Policy matrix (Q-matrix) was initialized with all zeros. Moreover, missing data imputation using our proposed approach was repeated 100 times to check the generalizability of the method. The results were found similar to the performance over a single iteration (presented later in Table 12). For each experiment, we calculated the MAE and RMSE between the imputed and the ground truth values in the test dataset. We used other data imputation techniques such as imputation by mean/median/the most frequent value, nearest neighbour-based imputation, random forest-based imputation, multiple imputation by MICE, GAIN, and CGAIN to compare the performance of our proposed approach for data imputation. The performance of our proposed approach is presented in Tables 8, 9, 10, 11, compared with other imputation methods, for varying amounts of missing data in all the eight UCI datasets used in this work. Our proposed approach has outperformed the other imputation methods on six datasets and remained in the top three for the other two datasets.
As can be seen in Table 8, our RL-based approach performs well compared to other univariate single imputation and ML-based imputation approaches. Our approach produces a MAE of 0.0183 compared to 0.0271, 0.0201, and 0.0208 for mean, median, and the most frequent valuebased univariate single imputations, respectively, for Spambase dataset with 5% missing data. For the same settings, the RMSE of our approach is 0.0485 compared to 0.0544, 0.0591, and 0.0615 for mean, median, and the most frequent value-based univariate single imputations. The machine learning-based imputation methods produce a MAE of 0.0309, 0.0309, 0.0286, 0.0501, and 0.0447; and an RMSE of 0.0719, 0.0696, 0.0588, 0.0723, and 0.0611, for kNN-based imputation, RF-based imputation, multiple imputation using chain equations, GAIN, and CGAIN, respectively (see Table 8). The imputation performance of our proposed approach outperforms other approaches with increased proportions of missing data (see Tables 9, 10, 11). The overall imputation performance (measured as MAE and RMSE) decreases for all the methods, as the amount of missing data increases from 5 to 20% (see Tables 8,9,10,11), since less data are available to estimate the missing values. Our proposed approach gives a MAE of 0.0198 compared to 0.0278, 0.0210, 0.0217, 0.0319, 0.0321, 0.0290, 0.0595, and 0.0430, for mean, median, most frequent value-based, kNN-based, RF-based, multiple imputation using chained equations approach, GAIN, and CGAIN, respectively, for Spambase dataset with 20% missing data (see Table 11). The RMSE of our approach, for the same settings, is observed as 0.0527 compared to 0.0593, 0.0635, 0.0667, 0.0750, 0.0715, 0.0620, 0.0764, and 0.0601 for mean, median, most frequent value-based, kNN-based, RF-based, multiple imputation using chained equations approach, GAIN, and CGAIN, respectively. The performance of our proposed approach remained at the top for six datasets, and secondbest and third-best for Letter recognition dataset and Breast cancer dataset, respectively.

Discussions
The univariate single imputation techniques such as imputation with mean, median, or most frequent value do not account for the variations in the imputed values because they impute the same value for each missing value of a column/feature in the dataset. In this work, we have used a reinforcement learning-based approach to account for variations in imputed values and improve the overall estimation of the missing data. Our approach learns a policy, from the training dataset, on how to vary the imputed values to bring them closer to the ground truth value. The learnt policy is used in the testing phase to vary the imputed value to better estimate the missing values. Our approach has worked well compared to the imputation performance of other imputation methods (see Tables 8, 9, 10, 11). The performance of univariate single imputation techniques deteriorates when the proportion of missing data increases. This is because the singular value (such as mean, median, or the most frequent value) is estimated with fewer data samples and the estimate is likely to be less representative of the entire population. The same trend is observed in other imputation approaches as well as our approach (see Figs. 4 and 5). This trend is reasonable since the ML-based approaches are known to perform well given more data for training. In our approach, we can argue that as the percentage of the missing data increases, the algorithm is not able to learn the best imputation policy which worsens the overall performance.
The performance of our imputation approach is followed by the mean-and median-based imputation techniques, in three datasets (Spambase, Default credit card, and News popularity). This trend is reasonable since the mean and median imputations estimate the missing values with average values. These average values present a reasonable guess given the distribution of the data is normal. As seen in other studies, the mean and median imputation approaches yielded superior results than the multiple imputation approach in our study likely due to the small size of missing data in our data set [10]. The multiple imputation approaches have been shown to produce a more dispersed imputed values thus affecting their performance when used with a small missing data [10]. Multiple imputation, a popular imputation approach from statistics, has not performed well compared to the ML-based approaches. This might be because the multiple imputation approach creates several imputed values for each missing value, where each estimate is regressed from the observed features. The models used to predict an estimate of the missing value, in the case of multiple imputation, cannot exploit the complex relationships among the observed data. This leads to the inadequate performance of this imputation approach.
It should be noted that although GAIN [39] is a supervised approach, the proposed RL approach consistently outperforms GAIN. In addition, compared to a recently proposed CGAIN [3], the proposed RL approach produced superior results on six datasets and slightly inferior results in two datasets. It should further be noted that CGAIN uses a supervised learning approach, which requires a large amount of trained data, while the proposed approach is RL based. Table 12 shows the performance of our proposed approach compared with other imputation approaches, on different thresholds of missing data, over 100 iterations of data imputation to check the generalizability of our approach. Our proposed approach produces an average MAE (mean ± standard deviation) of 0.01781 ± 0.00091, 0.01859 ± 0.00093, 0.0198 ± 0.00083, and 0.02017 ± 0.00054 for 5, 10, 15, and 20% missing data, respectively, in Spambase dataset when the imputation is repeated 100 times. The RMSE, in the same experiment, is recorded as 0.04936 ± 0.00127, 0.05012 ± 0.00114, 0.051 ± 0.0006, and 0.05287 ± 0.00058 for 5, 10, 15, and 20% missing data, respectively. Table 13 presents the mean and standard deviation of the original Spambase data, original data with missing values, and data with missing values imputed using our proposed approach. Our proposed RL-based imputation This characteristic of our approach allows to impute accurate values for missing data, which ultimately improves imputation performance. These results show a similar distribution of the data imputed using our approach compared with the original data distribution, at lower rates of missing data (5 and 10%). Understandably, the gap between the original and the imputed data distribution increases when the percentage of missing data increases. The limitations of our approach include the use of numeric data variables only. Future works will focus on the inclusion of categorical variables in our approach. An extension of this work will focus on the use of additional environment information to guide the agent during the policy learning phase.

Conclusion
Missing data imputation has been previously addressed using either a univariate single imputation which discards the variability in the imputed data, or by approximating the missing values using ML models which impute data by exploiting the inherent relationship between the observed features. We proposed an RL approach to learn a good imputation strategy, from experimental trials and the feedback received in response to these trials. Our approach learns the best policy to impute missing data using a trial and reward mechanism for the better approximation of the missing data. The proposed approach has shown superior performance with lower RMSE compared to other data imputation techniques on publically available datasets.  Another advantage of our approach is its power to maintain the original distribution of data during the process, i.e. the distributions of the imputed data and the original data are similar.

Declaration
Conflict of interest The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.