Introduction

Missing data is a general weakness capable of making the prediction system ineffective [1,2,3]. Therefore, ignoring it adversely affects the analysis [4,5,6,7,8,9], learning outcomes, prediction[10] and potentially weakens the validity of the results and conclusions [8, 9], thereby leading to the estimation of biased parameters [7, 11,12,13,14]. Prediction and classification are the principle obligations needed in many areas and spaces to obtain accurate data [15].

Data mining is the festivity commonly found among the most emerging fields of the current epoch. It is associated with extensive data generation, thereby leading to the need to mine interesting trends and patterns [16]. Data mining methods only work with complete data [1, 17,18,19], however this is dependent on the amount of missing data/rate and the domain. The Do Not Impute (DNI) method is used to ensure all missing data remains unreplaced. Therefore, the networks need to use their default missing values strategies [20]. However, in practice, most data analysis techniques are not robust to missing values, hence, they are filled in advance [21]. The issues related to the missing data is the research opportunities used to obtain the correct procedures [22].

Class center missing value imputation (CCMVI) is a hybrid process that uses the statistical and machine learning approaches with the baseline imputation method using k-NN to combine the imputation method [23]. Many techniques do not correlate the data attributes due to their categorization suitability [24]. In previous studies, most of the data estimation methods were missing, with the inputs dependent on the attributes [25]. The performance of the imputation algorithm of the missing data is significantly influenced by the correlation structure in the data [1, 4, 26], as well as its missing distribution [27, 28], and the proportion mechanisms [1].

The adaptive search procedure is possible for the new techniques to estimate missing values in accordance with the correlation between variables [29]. This procedure helps in determining the estimation of missing data by optimizing objective functions in each search problem given [30]. In late 2007 and early 2008, X.S Yang developed the Firefly Algorithm (FA), which implemented an adaptive search procedure [31]. Firefly Algorithm is an optimization algorithm inspired by nature which is considered matured due to its inception approximately ten years ago [32]. The behavior of fireflies during which the brighter ones comprises weaker flashes is applied within the imputation of missing data. It is carried out by determining the estimated value closest to the known and replacing the missing ones [9].

This research proposed the class center-based imputation method of missing data using the Firefly Algorithm in the imputation process. Furthermore, the research is developed from preliminary studies carried out by the author [33]. This research contributed to the imputation method based on the class center by considering the correlation of attributes with the imputation stage combined with the Firefly algorithm (C3-FA). In previous studies, the firefly algorithm was considered in the imputation process based on the class center that has never been used. Therefore, the advantage of this method over others is that it is able to reproduce the actual values and maintain adequate distribution without knowing the original data. The proposed technique is tested on the iris, wine, ecoli, and sonar datasets. In addition, the imputation performance is measured based on the predictive, distributional, and classification accuracies.

The remaining section of this research is organized as follows: In ‘Related work' section, a brief review of some related work on missing data imputation was analyzed. The authors' detailed approaches for handling missing data are proposed in section 'The proposed methods.' Also, the research experiments are presented in section 'Experimental results.' Discussion, future work, and conclusion are presented in section 'Discussion and conclusions.’

Related work

Methods for handling missing data

The methods used to handle missing data are dependent on the type of data and needs. Therefore, the imputation techniques are classified into two types, namely statistical-based and machine learning [34]. The statistical methods widely used in previous studies were Expectation maximization (EM), Linear/logistic regression (LR), Least squares (LS), and mean/mode, while machine learning includes Decision Tree (DT), clustering, k-Nearest Neighbor (k-NN), and Random Forest (RF) [35]. In 2018, Tsai proposed a class center-based missing data imputation method, which is a combination of statistical and machine learning techniques. Figure 1 shows a method for handling missing data.

Fig. 1
figure 1

Methods for Handling Missing data

Class center based missing value imputation (CCMVI) algorithm

The CCMVI is a hybrid method that combines imputations through statistical approaches and machine learning [23]. This study comprises of two modules, namely threshold identification (Module A) and imputation (Module B). The algorithm of each stage is shown in algorithms 1 and 2.

figure a
figure b

Some of the issues associated with the CCMVI include the proposed method only uses a simple distance function, namely Euclid, and does not compare performance based on the MAR and MNAR mechanisms. In addition, the simulation results showed that the imputation performance of the proposed CCMVI is not better than the statistical approach for categorical data.

Effect of attribute correlation on missing data algorithm

A lot of techniques ignore the correlation between attributes, despite their suitability for categorical data [24]. The missing data imputation algorithm's performance is significantly influenced by factors, such as the correlation structure, missing mechanism, entries distribution, and the values percentage [1].

In a study entitled, "Missing data imputation for the analysis of incomplete traffic accident data," Deb and Liew proposed an algorithm (see in algorithm 3) using a decision tree to determine the correlated attributes. The measure used was the IS and the weighted similarity. Therefore, the algorithm outperforms several popular imputation methods on the traffic accident dataset with the condition that a large number of attributes are categorical [5].

figure c

For high dimensional cases, the nearest neighbor approach's imputation method proposes a new distance that explicitly uses the correlation between variables. This method automatically selects the relevant variables contributing to distance, while the simulation results showed that performance depends on their correlation [26]. In addition, the accuracy of the imputation is based on the dependency level of the missing data with other variables in the dataset. This can be ignored for completely uncorrelated attributes [4].

Firefly algorithm—adaptive search procedure

The Firefly algorithm applies the adaptive search procedure developed by X.S Yang in late 2007 and early 2008 [31]. It is an optimization algorithm inspired by nature significantly developed for approximately 10 years [32]. Therefore, to properly design the Firefly algorithm, two important issues need to be defined: attraction and light intensity variation. The objective function influences this intensity, and the level for a firefly minimizing problem x is expressed as \(I(x) = \frac{1}{f(x)}\). The value of (x) is the level of light intensity on the firefly x, which is inversely proportional to the solution of the problem's objective function to be searched f(x).

The Attractiveness β is of relative value because the light intensity needs to be determined and judged by other fireflies. However, the assessment results tend to differ depending on the distance between one firefly and another rij. Meanwhile, the intensity decreases from the source because it is absorbed by the media, such as air. The Attractiveness (β) with a distance r is calculated in Eq. (1).

$$\beta = \beta_{0} e^{{ - \lambda r^{2} }}$$
(1)

where β0 denotes the attraction when there is no distance between the fireflies i.e. (r = 0) and γ ∈ [0, ∞) is the light absorption coefficient. The distance between the two fireflies i and j at positions xi and xj is the Euclidean distance calculated in Eq. (2).

$$r_{ij} = \left\| {x_{i} - x_{j} } \right\| = \sqrt {\sum\limits_{k = 1}^{n} {(x_{i}^{k} - x_{j}^{k} } )}$$
(2)

where xik is the k-th component of xi in firefly.

The movement carried out by firefly i, due to its attraction to j with brighter light intensity, changes its position according to Eq. (3).

$$x_{i\_new}^{k} = x_{i\_old}^{k} + \beta_{0} e^{{ - \lambda r^{2} }} (x_{j\_old}^{k} - x_{i\_old}^{k} ) + \alpha (rand - \frac{1}{2})$$
(3)

The first term is the old position of the firefly, the second occurs due to attraction, while the third term is a random firefly movement where α is the parameter coefficient and rand is a real number in the interval [0,1]. Most of the Firefly Algorithm implementations use β0 = 1, α ∈ [0,1] and γ ∈ [0, ∞).

The proposed methods

Basically, there is no generic imputation method used in determining data loss. Therefore, this research is carried out to obtain the missing data by considering the attribute relationship/correlation, which is a development of class center-based imputation. Furthermore, the consideration was carried out in accordance with the weaknesses of the previous method in the category data. Preliminary studies with categorical and attribute correlation yielded very good accuracy, which showed that the class center-based imputation method is an efficient imputation technique used to obtain the data's actual value.

An adaptive search procedure can be used to estimate the missing data that correlates with the associated variables. The Firefly algorithm implements the search procedure in such a way that the bright fireflies attract the weak ones, with the behavior applied in the imputation of missing data. The bright fireflies represent the non-missing data, while the weaker ones denote the missing dataset. The equations related to light intensity, attractiveness, distance, and firefly movement were applied in the imputation process using the class center approach and considering the correlation between the attributes. The result was therefore analyzed by comparing the distance obtained from the imputation ± std result. The overall architecture novel framework for missing data imputation (C3-FA) can be seen in Fig. 2.

Fig. 2
figure 2

The Overall Architecture Novel Framework for Missing data Imputation (C3-FA)

In general, this research is divided into three parts, data collection, data imputation, and performance imputation, as shown in Fig. 3.

Fig. 3
figure 3

The research step

Data collection

The beginning of this work consists of accessible datasets like iris, wine, Ecoli, and sonar. All datasets were gathered from the Kaggle, UCI Machine Learning Repository, and Knowledge Extraction based on Evolutionary Learning. Table 1 shows the characteristics of the dataset to be used in the simulation, with the relationships between variables and the probability of missing data indicated by MCAR missing mechanisms.

Table 1 Summary of dataset characteristics

The steps associated with data collection are summarized in the following pseudocode:

  • The dataset is divided into two parts, namely complete (Di_complete) and incomplete subsets (Di_incomplete)

  • For the ith class, calculate its class center (cent(Di), and variance/standard deviation (stdi) of Di_complete. The cent(Di) is used to determine each data attribute's average value for class i of the complete subset.

  • Calculate the distances between cent(Di) and other data samples in class i using Euclidean distance formula (4)

    $$Dis(cent(D_{i} ),i) = \sqrt {\left( {x_{i} - cent(D_{i} )} \right)^{2} }$$
    (4)
  • Calculate attribute correlations (R) of complete subset using formula (5).

    $$R_{{x_{1} x_{2} }} = \frac{{n\sum {x_{1} x_{2} } - \left( {\sum {x_{1} } } \right)\left( {\sum {x_{2} } } \right)}}{{\sqrt {\left( {n\sum {x_{1}^{2} } - \left( {\sum {x_{1} } } \right)^{2} } \right)\left( {n\sum {x_{2}^{2} } - \left( {\sum {x_{2} } } \right)^{2} } \right)} }}$$
    (5)

Data imputation

In the firefly algorithm, the intensity (I) of light is influenced by the objective function. Furthermore, the firefly pattern in which those with dimmer light intensity approach the brighter group is determined in the missing data's imputation. Dim and brighter light fireflies are analogous to the missing and complete data attributes.

The class center as the basis of imputation is used as the objective function f(x). Therefore, the class center's value is used as the first step in determining I(x) where x is the attribute in the data. The steps of data imputation are summarized as follows:

  • For each attribute in the complete subset, calculate I(x) based on the value of the objective function f(x), which is the class center, where

    $$I(x) = \frac{1}{{CentD_{i} }}$$
    (6)
  • Find the value \(I(x) = \frac{1}{{x_{i} }}\) that is greater than \(I(x) = \frac{1}{{CentD_{i} }}\). When a data greater than I(x) is obtained, the data movement \(x_{i\_new}^{k}\) is updated using Eq. (7) with \(\beta_{0} = 1\) based on previous studies, \(r = Dis(cent(D_{i} ),j)\), \(\alpha \in \left[ {0,1} \right]\), and rand is random numbers whose range is between [0,1].

    $$x_{i\_new}^{k} = x_{i\_old}^{k} + \beta_{0} e^{{ - \gamma r^{2} }} \left| {centD_{i} - x_{i\_old} } \right| + \alpha \left( {rand - \frac{1}{2}} \right)$$
    (7)
  • If the cent(Di) value of the attribute that contains missing data is the same as the cent(Di) of correlated attribute data, use \(\gamma = centD_{i}\)

  • If the cent(Di) value of the attribute that contains missing data is smaller than the cent(Di) of correlated attribute data, use \(\gamma = \left( {{\raise0.7ex\hbox{${centD_{i} }$} \!\mathord{\left/ {\vphantom {{centD_{i} } {R_{{x_{1} x_{2} }} }}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{${R_{{x_{1} x_{2} }} }$}}} \right) + \left| {diff\;of\;centD_{i} } \right|\)

  • If the cent(Di) value of the attribute that contains missing data is greater than the cent(Di) of attribute data that is correlated, using \(\gamma = \left( {centD_{i} \times R_{{x_{1} x_{2} }} } \right) - \left| {diff\;of\;centD_{i} } \right|\)

Analyse the imputation results by comparing the distance of the data with the class center generated from the previous imputation value +—stdi. The imputed values with the shortest distance from the class center are used to replace the missing data j.

Imputation performance

Evaluation of imputation performance is based on predictive accuracy (PAC), distributional accuracy (DAC), and Classification Accuracy. The efficiency of imputation techniques is PAC, which aims at obtaining the real value in the data. Pearson Correlation Coefficient (r) and Root Mean-Squared Error (RMSE) are two measures for evaluation of PAC [27, 28]. Furthermore, the Pearson correlation coefficient provides a measure of the correlation between the value of the imputation results with the actual. An imputation technique is efficient when the correlation value close to 1 [27, 28]. Therefore, when \(x\) and \(\hat{x}\) are the attribute values in the complete and incomplete data, then the correlation coefficient is calculated using the following formula (8)

$$r = \frac{{\sum\limits_{i = 1}^{n} {(x_{i} - \overline{x}_{i} )(\hat{x}_{i} - \overline{\hat{x}}_{i} )} }}{{\sqrt {\sum\limits_{i = 1}^{n} {(x_{i} - \overline{x}_{i} )^{2} \sum\limits_{i = 1}^{n} {(\hat{x}_{i} - \overline{\hat{x}}_{i} )^{2} } } } }}$$
(8)

The Root Mean Squared Error (RMSE) is the most benchmark for comparison and performance of prediction strategies by measuring the distinction between the imputation and real values. In this case, a value closer to 0 results in a better imputation [27, 28], with the fit to model calculated using the RMSE, which describes the close relationship of the predicted value to the true value. When the RMSE value is lower (error value), the predictions become better. Therefore, the RMSE is calculated using the following formula (9).

$$\sqrt {\frac{1}{n}\sum\limits_{i = 1}^{n} {(x_{i} - \hat{x}_{i} )^{2} } }$$
(9)

where \(\hat{x}_{i}\) is the actual value, \(\hat{x}_{i}\) is the predicted value, and n is the total number of missing data.

DAC represents the technical ability to maintain the true distribution of data values. It was assessed using the Kolmogorov–Smirnov distance (DKS). Therefore, when \(F_{x}\) and \(F_{{\hat{x}}}\) are the empirical cumulative distribution functions of \(x\) and \(\hat{x}\) then DKS is calculated using formula (10). Smaller distance values represent better imputation results [27, 28].

$$D_{KS} = \left\| {F_{x} - F_{{\hat{x}}} } \right\|$$
(10)

Another strategy used to evaluate the performance imputation is to look at the classification accuracy of some chosen classifiers trained by the imputed datasets. Decision Tree (DT), k-nearest neighbors (KNN), Naïve Bayes (NB), and Support Vector Machine (SVM) are the top classifiers constructed for evaluating the imputation results [35].

Experiment result

Evaluation of Predictive Accuracy, Distributional Accuracy, and classification accuracy on class center-based firefly algorithm for missing data imputation is determined based on each dataset's experiments.

Predictive accuracy (PAC)

The correlation coefficient provides a measure of the correlation between the worth of the imputation results with the value. Therefore, an increase in the value of r indicates that the imputation method used is efficient. Root Mean Square Error (RMSE) is the most benchmark for comparing the performance of methods by measuring the difference between the imputation and original values. Table 2 lists the result for correlation coefficient and RMSE with different datasets.

Table 2 Correlation Coefficient and RMSE over datasets

The average of r in the iris, wine, Ecoli, and sonar dataset is close to 1, indicating a correlation between the value of the imputation results and the missing data. In addition, the average value of RMSE is closer to 0, which means that there is no difference between imputation and real values. Furthermore, there is an increase in the value of RMSE in the wine dataset because it comprises three attributes with a standard deviation. The comparison of RMSE values with another imputation method such as SVM based on RBF kernel, KNNI, weighted voting random forests (WRF), feature weighted grey KNN (FKKNI), and class center missing data imputation (CCMVI) in previous research [23] are shown in Table 3.

Table 3 RMSE for different imputation Methods Vs Proposed Method

Table 3 shows that the proposed method has a smaller RMSE value.

Distributional accuracy (DAC)

Imputation methods need to be able to maintain the distribution of these values through the evaluation of distributional accuracy (DAC). Therefore, this shows that the distribution of data after imputation does not change with the original data distribution using The Kolmogorov–Smirnov test. This statistic test quantifies a distance between the empirical distribution performed by a dataset once imputation and, therefore, the original dataset's cumulative distribution function as a reference distribution. Table 4 shows the simulation results of Kolmogorov–Smirnov distance (Dks).

Table 4 The value of Dks over datasets

Based on the result in Table 4, the class center-based firefly algorithm imputation can maintain the missing data distribution.

Classification accuracy

Classification accuracy is additionally examined to measure the variations between the initial values and those imputed by the proposed method. This accuracy testing uses several classification algorithms, Decision Tree (DT), Support Vector Machine (SVM), and k-Nearest Neighbor (KNN). Table 5 lists the classification performance for various datasets. The results showed that, on average, the proposed technique makes the SVM classifier offer the best classification accuracy rate.

Table 5 Classification accuracy test result

Discussion and conclusions

The three problems considered in the experimental procedure for missing data imputation include the selection of dataset, imputation methods, and evaluation results (Lin and Tsai, 2020). The selection of dataset for experiment relates to the problem domain (general or specific), the completeness of trial data (complete or incomplete), the type of test (numeric, categorical, or mixed), the scenario of data loss (MCAR, MAR, MNAR), and its percentage (missing rate). Meanwhile, regarding the imputation method, there is no comprehensive study that compares missing data imputation techniques in different dataset domains, with various rates and mechanisms [35]. These findings make it possible to understand the most suitable technique for an incomplete dataset type.

The adaptive search procedure performed in the Firefly algorithm is used to overcome the missing data in a dataset. In addition, the use of the class center as an initial objective function helps to determine the most optimal imputation value. Therefore, based on the simulation results from the datasets used, the general result showed that the class center-based firefly algorithm is an efficient technique for determining the actual value in handling the missing data. This is indicated by the values of the Pearson Correlation Coefficient (r) and Root Mean Squared Error (RMSE), which are closer to 1 and 0, respectively. In addition, the proposed method tends to maintain the true distribution of data values. This is indicated by the Kolmogorov–Smirnov test, which stated that the value of DKS for most of the attributes in the dataset is generally closer to 0. Both results are in line with the fact that the imputation method can ideally reproduce actual values in data or Predictive Accuracy (PAC) and maintain the distribution of those values or Distributional Accuracy (DAC) [36]. However, some findings were obtained from the simulation results of wine datasets. Therefore, when the value of the standard deviation is high (more than 1), RMSE tends to increase.

Furthermore, previous studies' imputation methods were only tested on one missing data mechanism (MCAR/MAR/MNAR). Therefore, further studies need to conduct tests by classifying the dataset based on the missing rate, starting from 10%, 20%, 30%, 40%, and 50% with the three mechanisms. Also, this follow-up study is expected to produce some rules from the different classes in the dataset with varying number of samples (S), Attribute (A), SA ratio, type of dataset, and percentage of missing data (missing rate). In addition, studies need to be conducted to determine the best imputation method for missing data on attributes that the value of standard deviations is high.