Introduction

The missing of data or missing value is a frequent issue in real-world data analysis. Missing values in datasets may be shown as “?”, “nan,” “N/A”, or “blank cells”. In most studies, missing data is a common and challenging problem because it can lead to biased, inaccurate, and unreasonable conclusions when it is mishandled [1,2,3,4,5,6]. A complete dataset is required for the current analytical methods to operate, as shown by [7, 8], with related missing variable issues serving as opportunities to obtain the correct problem-solving technique [9]. Missing data is shown to be a common problem in classification tasks, leading to the prediction system's ineffectiveness [10]. The ignorance of this issue has an impact on analytical [1, 11, 12], learning, and predictive outcomes for problems involving collaborative prediction, respectively [13]. Furthermore, it can undermine the validity of results and conclusions [3, 12]. In predictive models, the improper selection of missing data methods often affects the model performance [4, 14] as well as the accuracy and efficiency of the classifiers [15].

Univariate or multivariate, monotone, non-monotone, connected, unconnected, planned, and random is the pattern of missing values can be seen in Fig. 1.

Fig. 1
figure 1

The pattern of missing values

One of the strategies to deal with missing data is to data imputation. Data imputation is defined as the process of replacing the missing values in a data set through an estimation process with certain values so that a complete data set is produced. Currently, most of the models dealing with missing data use an imputation strategy to complete it [16]. Class center missing value imputation (CCMVI) is develop by [17] and based on the results of experiments carried out for categorical datasets, the classification accuracy values are not better than the approach with mode imputation. Class center-based imputation can obtain an accurate data value when the correlation is considered. However, it does not effectively work on the datasets with high standard deviation attributes [18]. An adaptive search is an alternative to estimate missing data when considering correlations [19]. As a result, it can also estimate the number of missing values [20] in any search problem by maximizing the objective function.

To subsequently implement an adaptive search, Xin-She Yang created the Firefly algorithm at the end of 2007 and the beginning of 2008 [21]. This improved algorithm was inspired by nature and has progressed significantly since its inception a decade ago [22]. The firefly algorithm (FA) is a heuristic optimization algorithm inspired by nature that is based on the luminescence and attraction behavior of fireflies [23]. The FA algorithm is used for a number of reasons, including its effectiveness in solving continuous optimal problems [24], it’s a simple and effective swarm intelligence algorithm that has garnered significant scholarly attention [25], and its widespread application to the solution of complex engineering optimization problems [26]. However, the FA algorithm’s effectiveness in missing data estimation tasks has not been studied [19]. In the case of missing data imputation, the firefly behavior, in which a bright firefly attracts a firefly with a weaker brightness, can be used. According to reports, this is accomplished by obtaining the closest predicted value to the known variable and then substituting the missing data [27].

According to several previous studies by author, a class center-based firefly algorithm was developed for missing data imputation [28] through the consideration of correlation [18] otherwise known as the C3FA algorithm. The overall architecture framework of C3FA can be seen in Fig. 2.

Fig. 2
figure 2

Architecture framework of C3FA [28]

Further research conducted by the author where at the beginning the imputation method, the author employed a standardization and outlier identification strategy. The result showed combining normalization and outlier removals in C3FA was an efficient technique for obtaining actual data in handling missing values [29]. However, the imputation method with the class center-based firefly algorithm was not analyzed on the categorical datasets in other previous studies.

Data with categorical variables must be coded into appropriate vectors using feature engineering [30]. In the preprocessing stage, categorical variables are also found to be necessary because the majority of machine learning models only consider numerical values. This indicates that the categorical variable should be numerically converted for the model to recognize and retrieve important information [31]. Furthermore, there are numerous approaches for encoding categorical variables for modeling, with one commonly utilized method being the target encoding (TE). This method encodes categorical data [32] and a variant of the continuity scheme based on the value difference metric [33]. In this method, each category is also coded based on its effect on the target variable [30]. For TE, the average target variable for each category is reportedly calculated, subsequently replacing the categorical value with the mean [31]. However, there is an overfitting risk with target encoding, where the accuracy of the machine learning model is effective and ineffective on training and test data, respectively. One frequent smoothing strategy is to combine the category target with the global target mean for each data point (smoothing target encoding).

The contribution of this study is combination smoothing target encoding (STE) before the performance of the missing data imputation with class center-based firefly algorithm in the imputation method. Other contribution of this study is combination of the previously generated imputation was conducted with the standard deviation value (STD) of each attribute (C3FA ± STD) gives different results for each type of missing rate of data where previous research has never been done. In this study also, each of these methods is selected from the imputation results that produce the closest distance to the class center of each attribute after the imputation process is carried out. The best results are also compared with the existing methods, mode imputation and decision tree imputation.

Target encoding

The use of encoding techniques is often analyzed on machine learning platforms with many complex datasets containing features with high cardinality. When the number of levels reaches a point where an encoding indicator is present, several unreasonable features are often provided and orderly mapped to integer values. Another general strategy aims to reduce the number of levels by several methods, such as hierarchical clustering based on the statistics of the target variable. However, these are rarely described in several scientific publications [34].

Target Encoding (TE) is often used in encoding category data [32], where each group is encoded based on its effect on the target variable [30]. This method indicates that the mean target variable for each category is calculated, subsequently replacing the categorical value with the average data [31]. However, target encoding has overfitting risk, where the accuracy of the machine learning model is effective and ineffective on the training and test data, respectively. The statistics calculated on the cluster are also likely to be wildly inaccurate due to the infrequent occurrence of the categorical data. Therefore, the solution to this problem is the addition of smoothing, based on the combination of the categorical and overall averages as follows,

$${\text{Smoothing}}\,{\text{Target Encoding}}\;(STE) = w \times TE_{i} + \left( {1 - w} \right) \times \overline{TE}$$
(1)

The weight (w) is a value between 0 and 1, calculated from the category frequency with the following formula,

$$w = n/(n + m)$$
(2)

where n is the number of categorical occurrences in the data and m is the smoothing factor. In this equation, a larger m-value subsequently provided more weight to the overall estimate. TEi is average in- category i and \(\overline{TE}\) is the overall average from all categories. To determine the difference in the results of TE and STE with several m-values, a trial was carried out using a tic tac toe dataset as follows (Fig. 3). The tic tac toe dataset was used in a previous study by [17] for the categorical dataset.

Fig. 3
figure 3

Flow map of target encoding and smoothing target encoding on tic tac toe dataset

The evaluation comparison between both methods (TE and STE) on this dataset is also shown in Fig. 4. For a classification problem, the AUC (Area Under The Curve) ROC (Receiver Operating Characteristics) curve to check or visualize the performance of the classification problem. During the analysis, we evaluate the TE and STE method performance using four metrics: classification accuracy (CA), precision, recall, and F1-score. Smoothing balances category values with overall averages to reduce small-group influence. The value of 10–100 means weight of smoothing, and large values leads to global averages. Based on Fig. 4, the STE method had a better AUC, classification accuracy, FI Score, precision, and recall value than the TE method. This indicated that categorical data were numerically transformed in the next stage using STE method. The AUC value is higher, it indicates that the model is performing better when it comes to differentiating between the positive and negative classes.

Fig. 4
figure 4

Comparison of target encoding accuracy and smoothing target encoding on the tic tac toe dataset

Imputation method

Imputation method fill in the missing data to produce a complete data matrix that can be analyzed using standard techniques. The imputation methods used in this study are, mode imputation, decision tree imputation, and class center-based firefly algorithm. The missing data mechanism used in the experiment is Missing Completely at Random (MCAR). Missing Completely at Random is considered to be the simplest type of missing data to understand [35]. This type of missing data has no pattern among missing data values. This approach makes the assumption that the missing data (or missingness) is unrelated to any of the other observed or missing variables [36,37,38,39,40]. The probability (P) that the data variable is missing does not depend on the observed data value or the missing data value (P(missing|complete data) = P(missing)).

Mode imputation

One of the most naive and straightforward methods for filling in missing values for categorical variables is mode imputation [41]. This indicated that the non-missing value mode of each variable was used to calculate the missing data [41,42,43]. When using imputation mode, the value was found not to exceed the minimum or maximum requirements. However, the underlying data or distribution was mainly distorted, with bias observed to any estimate except the mean [44]. It did not also correctly address the uncertainty of the data set, subsequently leading to biased imputation [45]. Additionally, the mode imputation in [17] was superior to other methods, including the type proposed by the study of Tsai (the class center method), based on the MCAR missing data.

Decision tree imputation

This technique was initially and subsequently introduced by Shapiro (1987) and Quinlan (1987), respectively, where each attribute is missing values were determined using a decision tree. This method was then populated using the appropriate tree, with a separate construction produced through known value instances. The unknown values of specific attributes were then determined using these trees [46].

According to Creel and Krotki [47], the decision tree nodes were used to define the imputation class. Subsequently, these nodes were used to apply different imputation methods within the class [47]. This algorithm is found to handle numeric and categorical variables, as well as identify and eliminate most of the data and available remnants, respectively. Meanwhile, decision trees produce complex, time-consuming, and low-bias constructions [48]. Other previous studies using this algorithm were [5, 49,50,51], with the stages of imputing missing data shown on Fig. 5 [51].

  1. 1.

    Split a complete dataset (DF) into two sub-datasets, DC and Di (only has records without and with missing values).

  2. 2.

    Construct a set of decision trees in DC based on the attributes which have missing values in Di as class features.

  3. 3.

    Assign each record from Di to the leaf where the record's class attribute has a missing value. A record was assigned to more than one leaf with several attributes with missing values.

  4. 4.

    Calculate the categorical missing value using the majority class variable in the leaf.

  5. 5.

    Merge records to form a complete dataset (D’F).

Fig. 5
figure 5

The overall block diagram of decision tree imputation [51]

Class center-based firefly algorithm

The two key components of the Firefly algorithm are the variation in light intensity I(x) and the calculation of attractiveness β. The pattern of fireflies with a lower light intensity is more comparable to the collection with a brighter beam, which imputes missing data. This shows that lower light fireflies are comparable to the properties of missing data, with the complete variable feature being analogous to those with brighter beam intensity. Based on imputation, the class center was used as the objective function f(x), indicating that the value was the prefix in determining I(x). For class center-based missing data, the main steps of the Firefly Algorithm are summarized as follows,

  1. 1.

    Incomplete datasets are subdivided into subsets that are both complete and incomplete.

  2. 2.

    The class center and standard deviation of the complete subset should be calculated for each i-class.

  3. 3.

    Use Euclidean distance to calculate the distance between the centDi class center and the remainder of the data samples in i.

    $$Dis(cent(D_{i} ),j) = \sqrt {\left( {x_{i} - cent(D_{i} )} \right)^{2} }$$
    (3)
  4. 4.

    Calculate the attribute correlation (R) on the complete subset.

    $$R_{{x_{1} x_{2} }} = \frac{{n\sum {x_{1} x_{2} } - \left( {\sum {x_{1} } } \right)\left( {\sum {x_{2} } } \right)}}{{\sqrt {\left( {n\sum {x_{1}^{2} } - \left( {\sum {x_{1} } } \right)^{2} } \right)\left( {n\sum {x_{2}^{2} } - \left( {\sum {x_{2} } } \right)^{2} } \right)} }}$$
    (4)
  5. 5.

    The variable of the class center, f(x), is used to calculate I(x) for each attribute in the incomplete dataset.

    $$I(x) = \frac{1}{{cent(D_{i} )}} \,$$
    (5)
  6. 6.

    Determine the \(I(x) = \frac{1}{{x_{i} }}\) value greater than \(I(x) = \frac{1}{{cent(D_{i} )}}\). When data containing the highest I(x) is available, it is necessary to revise the movement \(x_{i\_new}^{k}\). Applying the assumption of the following movement equation \(\beta_{0} = 1\), \(r = Dis(cent(D_{i} ),j)\) and \(\alpha \in \left[ {0,1} \right]\) where β0 is the attractiveness at r = 0, and r is the distance between two fireflies. The parameter γ is the light absorption coefficient and α = 0.1 is the step factor.

    1. a.

      The following formula is utilized when the class center value (CentDi) of the missing data feature is similar to that of the correlated attribute data,

      $$x_{i\_new}^{k} = x_{i\_old}^{k} + \beta_{0} e^{{ - \gamma r^{2} }} \left| {centD_{i} - x_{i\_old} } \right| + \alpha \left( {rand - \frac{1}{2}} \right),{\text{ with }}\gamma = centD_{i}$$
      (6)
    2. b.

      The following formula is utilized when the CentDi value of the missing data feature is less than that of the correlated attribute data,

      $$x_{i\_new}^{k} = x_{i\_old}^{k} + \beta_{0} e^{{ - \gamma r^{2} }} \left| {centD_{i} - x_{i\_old} } \right| + \alpha \left( {rand - \frac{1}{2}} \right),{\text{ with }}\gamma = \left( {{\raise0.7ex\hbox{${centD_{i} }$} \!\mathord{\left/ {\vphantom {{centD_{i} } R}}\right.\kern-0pt} \!\lower0.7ex\hbox{$R$}}} \right) + \left| {diff\;of\;centD_{i} } \right|$$
      (7)
    3. c.

      The following formula is utilized when the CentDi value of the missing data feature is greater than that of the correlated attribute data,

      $$x_{i\_new}^{k} = x_{i\_old}^{k} + \beta_{0} e^{{ - \gamma r^{2} }} \left| {centD_{i} - x_{i\_old} } \right| + \alpha \left( {rand - \frac{1}{2}} \right) \, ,{\text{ with }}\gamma = \left( {centD_{i} \times R} \right) - \left| {diff\;of\;centD_{i} } \right|$$
      (8)
  7. 7.

    Compare the data distance with the class canter generated from the previous imputation value ± standard deviation to analyze the imputed results. The closest distance is also used to decide the results.

Performance evaluation

Imputing missing values is followed by an evaluation of the imputation results. The most popular strategy involves comparing the actual value in the collected data set to the estimated or predicted value in the incomplete data set known as direct evaluation using RMSE. Another method for evaluating the quality of imputation is to look at the classification performance of some classifiers trained on imputed datasets using classification accuracy (CA). Different imputation methods for the same incomplete datasets are likely to yield different imputation results, the classifier with higher classification accuracy is indicated by the higher imputation quality of its training and datasets. As a result, the most effective imputation methods can be identified [52]. The evaluation of machine learning models was utilized to determine the influence of smoothing target encoding on numerous imputation strategies. These included the AUC, Precision, Recall, and F1-Score models, based on the Confusion Matrix popularly used when solving classification problems (Table 1). This was subsequently applied to binary and multiclass classification problems, respectively [53].

Table 1 Confusion matrix for binary classification

The Confusion Matrix represents the machine learning-based data's predictions and actual conditions. The precision is also the ratio of the correct and overall positive predictions, respectively.

$${\text{Precesion}} = \frac{TP}{{TP + FP}}$$
(9)

The recall (or sensitivity) is defined as the ratio of true positive predictions to the total data.

$${\text{Recall}} = \frac{TP}{{TP + FN}}$$
(10)

The F1-Score is a weighted average precision and recall comparison.

$${\text{F1 Score}} = \frac{{2 \times {\text{Precision}} \times {\text{Recall}}}}{{\text{Precision + Recall}}}$$
(11)

AUC is the area under the curve of ROC (Receiver Operating Characteristic), which describes the sensitive and specific probability variables with boundary values between 0 and 1. Since this is a common approach for determining the quality of predicted predictions, it offers an overview of the model’s overall appropriateness measurement [54]. Furthermore, the classification accuracy analysis of the complete dataset was carried out through an algorithm, namely k-Nearest Neighbors (kNN). This was in line with previous studies [52], which showed that the widely used classifier to evaluate the performance of imputation accuracy is kNN. In addition to the evaluation described in the previous section, the method proposed in this study will also be tested based on the RMSE and R Square values.

Results

The first stage of this study was based on the selection of the tic-tac-toe dataset, it was accessed via the UC Irvine Machine Learning Repository. Dataset information can be seen in Table 2.

Table 2 Dataset Information

This contained a tic tac toe endgame footage, where the first nine attributes represented the nine fields on the board. However, the tenth attribute was the class feature containing the winning status information of player x. The stages of research carried out can be seen in Fig. 6.

Fig. 6
figure 6

Research Stages

  1. 1.

    The tic tac toe dataset was encoded using the STE method by RapidMiner (Fig. 7). RapidMiner is a system that facilitates the design and documentation of a comprehensive data mining procedure. It provides not only a nearly exhaustive set of operators, but also structures that express the process control flow [55].

Fig. 7
figure 7

Smoothing target encoding using RapidMiner

figure a

*Example amputation for tic tac toe dataset with missing rate 40% and missing mechanism MCAR

  1. 2.

    The tic tact toe dataset amputation with missing rate 10–60% and missing mechanism MCAR. The generation of missing values is a crucial step in the assessment of a methodology for missing data. the procedure whereby missingness is introduced as an amputation in complete data [56]. The generating amputation using R programming language can be seen in algorithm 1.

  2. 3.

    Imputation process using mode imputation (MdI) and decision tree imputation (DTI), and the proposed method developed in the previous study by author, the Class Centre-based Firefly algorithm (C3FA) [28].

  3. 4.

    Imputation result by Class Centre-based Firefly algorithm (C3FA) combined in a number of ways.

    1. a.

      Combination with the standard deviation (± STD) of each attribute (C3FA ± STD).

    2. b.

      Comparison of the distance between each data record, the smallest distance being used as a reference for the previously obtained imputation results (C3FA + Dist).

  4. 5.

    Evaluation of the performance of missing data method.

Imputation was subsequently carried out by replacing the missing values of the categorical variables through the MdI, DTI, and C3FA methods. This was conducted after obtaining the tic tac toe dataset with 10–60% missing values in the previous stage through the MCAR mechanism (MCAR_10—MCAR_60). The following are the analytical results based on the AUC, Classification Accuracy (CA), F1-Score, Precision, and Recall values.

Table 3 AUC, CA, F1-Score, Precision, And Recall Result with Mode Imputation
Table 4 AUC, CA, F1-Score, Precision, And Recall Result with Decision Tree Imputation
Table 5 AUC, CA, F1-Score, Precision, And Recall Result with C3FA Imputation

Tables 3, 4, 5 showed that the MdI, DTI, and C3FA methods produced AUC, Classification Accuracy (CA), F1-Score, Precision, and Recall values, which decreased with an increase in the percentage of missing data through the MCAR mechanism. The following is a comparison of the performance of the proposed method with the imputation mode and decision tree imputation methods as state-of-art techniques based on the average value of each performance.

Table 6 Comparison of Average Performance Evaluation

Table 6 shows that the proposed method produces an average value of AUC, Classification Accuracy (CA), F1-Score, Precision, and Recall than the imputation mode and decision tree imputation methods. Theoretical and empirical work demonstrates that when comparing two measures for learning algorithms, AUC is superior to classification accuracy based on formal criteria [57]. Our proposed method is 2.6% superior to the imputation mode method and 1.4% from the Decision tree imputation for the AUC value. Mode imputations are easy to implement, but they fail to account for relationships between variables and thus underestimate variance. The decision tree imputation (DTI) method and class center-based Firefly algorithm by incorporating attribute correlations into the imputation process (C3FA) are better than the mode imputation (MdI) methods because they consider attribute correlation. This result is in line with the fact that the performance of the missing data imputation algorithm is significantly affected by the correlation in the data [1, 11, 58,59,60]. Another advantage of the proposed method over other methods is the use of the firefly algorithm in the data imputation process to produce an optimal imputation value.

Another contribution of this research is the use of the standard deviation of each attribute in the data on the imputation results. The combination of imputation results with the standard deviation of the imputation method in previous studies has never been used. Using the proposed C3FA method, an imputation process was also carried out based on the combination of the previous results with the standard deviation (± STD) of each attribute. Based on the results of the imputation of C3FA, C3FA + STD, and C3FA—STD, a comparison of the distance between each imputed data to the class center was carried out. The smallest distance in the data will be used as a reference for the imputation results (C3FA + Dist). In addition, Figs. 8, 9, 10, 11, 12 are the performance evaluation of each combination,

Fig. 8
figure 8

Comparison of AUC results with C3FA imputation in the tic tac toe dataset MR:10–60%, MCAR

Fig. 9
figure 9

Comparison of CA results with C3FA imputation in the tic tac toe dataset MR:10–60%, MCAR

Fig. 10
figure 10

Comparison of F1-Score results with C3FA imputation in the tic tac toe dataset MR:10–60%, MCAR

Fig. 11
figure 11

Comparison of Precision results with C3FA imputation in the tic tac toe dataset MR:10–60%, MCAR

Fig. 12
figure 12

Comparison of Recall results with C3FA imputation in the tic tac toe dataset MR:10–60%, MCAR

Figures 8, 9, 10, 11, 12 shows that each combination of imputed results using the standard deviation and the resulting distance for each imputation shows a different pattern in each evaluation result. Previous studies related to data imputation methods have not used standard deviation as a consideration for data imputation results. Therefore, the findings in this study can be tested on the standard imputation method which is widely used in previous studies.

The method suggested in this study (C3FA ± STD) will also be tested based on the RMSE and coefficient of determination (R2) values in addition to the evaluation outlined in the previous section. Root Mean Square Error (RMSE) is widely recognized as the primary metric for comparing the performance of forecasting methods, as it measures the difference between the imputed value and the original value for a given feature. In this instance, a value closer to zero yields superior imputation [61, 62]. The correlation coefficient (r) is one of the most common ways to measure imputation ability and its square is the coefficient of determination (R2), which is the amount of variation that can be explained and is between 0 and 1. An efficient imputation technique must have an value R2 close to 1 [61,62,63,64].

Fig. 13
figure 13

Comparison of RMSE with C3FA imputation in the tic tac toe dataset MR:10–60%, MCAR

Fig. 14
figure 14

Comparison of R Square with C3FA imputation in the tic tac toe dataset MR:10–60%, MCAR

Based on the results of RMSE and R2, in general the C3FA method is efficiency of an imputation technique base one RMSE values closer to 0 and R2 values closer to 1 compared to the combination of C3FA ± STD methods as can be seen in Figs.13 and 14. The Predictive Accuracy (PAC) relates to the efficiency of an imputation technique to retrieve the true values in data that can be measures by R2 and Root Mean-Squared Error (RMSE).

Analysis and discussion

At the preprocessing stage, these categorical variables were converted to numeric data for the model to understand and retrieve useful information [31]. One method of numerically converting categorical data is observed through target encoding, whose limitations also entail overfitting risks and the inaccuracy of infrequent categorical data. Furthermore, the results of the simulation showed that the smoothing target encoding technique produced better classification accuracy values than the TE. Using the tic tac toe dataset, differences were observed in the values of AUC, Classification Accuracy, Precision, F1-score, and Recall, based on the C3FA method with several imputation combinations, namely C3FA + STD, C3FA-STD, and C3FA + Dist can be seen di Figs. 8, 9, 10, 11, 12. This is indicated that the C3FA + Dist method produced the best evaluation value for the total missing data of 10%. Meanwhile, the C3FA method produced the best values for the total missing data of 20%, 40%, and 50%, with C3FA-STD yielding optimized parameters at 30% and 60%. According to the average results of the AUC, Classification Accuracy, Precision, F1-score, and Recall values, the C3FA and C3FA-STD methods had better advantages than other combinations, as shown in Fig. 15.

Fig. 15
figure 15

Comparison of the performance of the C3FA imputation method in the tic tac toe dataset MR:10–60%, MCAR

The Smoothing Target Encoding method was used in the preprocessing stage before the imputation process within the C3FA method, which produced several patterns through the several missing data of each model. This showed a new result, where the imputation analysis did not produce better performance values by comparing the C3FA + Dist, C3FA + STD, and C3FA-STD methods. It also showed that C3FA + Dist had better results at a low missing data  < 20%. The C3FA method had a good performance on the missing data of 40% through the MCAR mechanism which is in line with previous research [28, 29]. Other experimental results showed that the C3FA-STD method produced the best performance evaluation when the dataset had a reasonably high amount of missing data (60%). However, based on the evaluation using RMSE and R Square values, the C3FA method showed better performance than C3FA ± STD.

Conclusion

Based on the preprocessing stage, categorical variables were significant because the machine learning models mostly considered numerical values. This indicated that the categorical variables were numerically converted for the model to understand and retrieve helpful information. Using the tic tac toe dataset and three (3) imputation methods, the proposed C3FA-STD method produced the AUC, CA, F1-Score, Precision, and Recall values of 0.939, 0.882, 0.881, 0.881, 0.882, respectively. However, the value outperformed the MdI and DTI methods when using the kNN classifier. This value outperforms the mode imputation method, the best method in previous studies [17] for categorical data, and the imputation method with a decision tree.

Standard deviation is a statistical measure in data other than the class center, and correlation that has been used in previous studies can be used as one of the considerations in the imputation results of missing data. When using the class center-based method, the correlation of attributes in the data was utilized in the imputation process because each data has a relationship with one another. Meanwhile, the imputation process was expected to produce an optimal or closest value to the actual rate. This is indicated that when considering the correlation in the imputation process, a Firefly Algorithm (FA) was used. However, the use of the FA algorithm which has been developed in other studies for optimization such as [23,24,25,26] can be tried in further research to handle missing data with a class center approach.

Based on the class center-based method, statistical measures such as standard deviation was used to combine the imputed results. This was due to the standard deviation being used to determine the closeness of a statistical data sample to the average variables. In the simulation results, differences were also observed in the performance evaluation of the proposed method through the combination of the imputation outputs and the standard deviation.

In order to validate the performance of the missing data imputation technique and arrive at a definitive conclusion, evaluation is an extremely important step that must first be taken. The missing data imputation technique can be evaluated in several different ways, the most important of which are the direct evaluation method, the classification accuracy of the classifiers, and the consideration of the computational time. However, this has not been the case, and all three evaluation metrics have not been used together in any of the related studies [52]. One of the future challenges of this research is evaluation based on computational time. Most imputation methods in previous studies were only tested on one missing data mechanism (MCAR, MAR, or MNAR). Therefore, further research that will be carried out is to conduct tests by grouping datasets based on the percentage of missing rate with the mechanism of not only MCAR, but also MAR and MNAR.