An application of machine learning regression to feature selection: a study of logistics performance and economic attribute

This study demonstrates how to profit from up-to-date dynamic economic big data, which contributes to selecting economic attributes that indicate logistics performance as reflected by the Logistics Performance Index (LPI). The analytical technique employs a high degree of productivity in machine learning (ML) for prediction or regression using adequate economic features. The goal of this research is to determine the ideal collection of economic attributes that best characterize a particular anticipated variable for predicting a country’s logistics performance. In addition, several potential ML regression algorithms may be used to optimize prediction accuracy. The feature selection of filter techniques of correlation and principal component analysis (PCA), as well as the embedded technique of LASSO and Elastic-net regression, is utilized. Then, based on the selected features, the ML regression approaches artificial neural network (ANN), multi-layer perceptron (MLP), support vector regression (SVR), random forest regression (RFR), and Ridge regression are used to train and validate the data set. The findings demonstrate that the PCA and Elastic-net feature sets give the closest to adequate performance based on the error measurement criteria. A feature union and intersection procedure of an acceptable feature set are used to make a more precise decision. Finally, the union of feature sets yields the best results. The findings suggest that ML algorithms are capable of assisting in the selection of a proper set of economic factors that indicate a country's logistics performance. Furthermore, the ANN was shown to be the best effective prediction model in this investigation.


Introduction
The World Bank's Logistics Performance Index (LPI) is well-known practical information available to policymakers for judging a country's logistics performance [1]. The LPI, which has been assessing nations' logistics performance on a biannual basis by analyzing survey data, is perhaps the most important tool to emerge from trade facilitation. It delivers a macroeconomic overview on how policymakers may favorably affect global supply chain capabilities and the performance of relevant businesses representing the efficiency of the clearance process, trade quality, and transportation-related infrastructure [2]. Perhaps it would be advantageous if there was constant and up-to-date information reflecting the country's logistical efficiency. It may monitor changes in variable data and continuously evaluate trends or predict logistics efficiency to provide policymakers with rapid access to projected logistics performance to improve the country's logistics and supply chain capabilities. Nonetheless, prior research has shown that variables such as institutional reforms and resource enhancements significantly accelerate logistics performance. Countries with a low level of corruption and a stable political environment, according to [3], are more likely to have a high level of logistics performance, and improvements in resource supply such as infrastructure, technology, labor, and education related to the country's competitiveness have a significant positive effect on performance.
Similarly, [1] emphasizes that governance weaknesses and societal instability might reduce performance. However, the aforementioned study variables are static data, and certain components were also collected through the survey.
Since then, the components, particularly the economic component, have demonstrated a significant link between a nation's logistics performance (expressed by LPI scores) and an economic element, such as the country's economic development indicator [1,4]. For example, GDP per capita [1], export and import volume show LPI components have a substantial beneficial influence on expanding international commerce for both import and export [5]. There are, nevertheless, unstudied substantial economic aspects that connect with logistics efficiency.
This study demonstrates how to profit from up-to-date dynamic economic big data, which contributes to the selection of economic attributes that indicate logistics performance as reflected by the LPI. The analytical technique employs a high degree of productivity in the field of machine learning (ML) for prediction or regression using an adequate set of economic feature subsets. Because the accuracy of ML prediction outcomes is reliant not only on the model structure and associated training algorithm but also on the feature space constructed using the initial feature set and feature selection algorithm [6]. In ML applications, feature selection is often employed as a portion of the pre-processing phase to get a subset of features by reducing elements with minimal predictive information [7].
There are two aims in this study: (1) to determine the ideal subset of economic features that best represents a particular anticipated variable for predicting a country's logistics performance and (2) to improve prediction accuracy by employing a set of alternative ML regression algorithms. This research looks into two major research questions: to begin, can ML algorithms assist in selecting the proper subset of economic features that reflect the country's logistics performance? Second, what is the appropriate ML regression approach for predicting logistics performance based on certain economic attributes? This paper's structure is as follows. Section 2 includes a review of the literature on feature selection and regression machine learning. Section 3 contains methodology, which includes the ML feature selection procedure, data sources and data preparation, data analysis, and parameter setup. Sections 4 and 5 provide the results and discussion, as well as the concluding remark and future work.
2 Literature review

Feature selection
The process of selecting a subset of important features, especially variables, for model building is known as feature selection. It is well recognized that a subset of relevant features may be beneficial in improving model performance. With minimum information loss, the feature selection process aims to eliminate duplicate or superfluous features and other features that are closely connected in the data. It is often used to make the model more comprehensible and to improve generality by decreasing variance [8]. The three forms of feature selection strategies are filter, wrapper, and embedded.
To rank all features, filter techniques rely on statistical properties. The challenge of feature selection is viewed as a ranking problem by the majority of filter techniques. These techniques are independent of the ML learning algorithm that will be used with the selected subset [9]. This category comprises mutual information-based, correlation-based, Chi-square test-based and principal component analysisbased techniques. Filter techniques are often used in highdimensional datasets because of their processing efficiency [8].
Wrapper techniques select feature subsets based on how useful they are to a certain predictor or classifier. Selection is viewed as a search problem in these techniques, with various feature combinations created, evaluated, and compared to other combinations. The search is driven by heuristic intelligent optimization techniques. Simplified methods, such as sequential search, or evolutionary algorithms, such as particle swarm optimization (PSO) or genetic algorithm (GA), that generate local optimal results and are computationally viable, are used to achieve good results [6]. As a selection criterion, these approaches use the performance of the inductive algorithm. They wrap the learning algorithm with feature selection and estimate the advantages of adding or deleting a feature using crossvalidation [9]. Wrapper techniques outperform filter methods because each cycle, a new prediction model or learning algorithm evaluates a different feature subset [10].
By embedding feature selection within the model learning, embedded methods provide a trade-off solution between filter techniques and wrapper methods. They return both the learned model and the selected features at the same time [11]. The learning and feature selection components of embedded techniques cannot be separated [12]. Embedded techniques include feature selection into the model training process; for example, starting regularization processes while the model is being trained is a frequent example [6]. Regularization models such as sparse linear discriminant analysis regularized support vector machine (SVM) and LASSO are the most commonly used embedded approaches [11]. For the instance of an embedded method, the LASSO technique normalizes the parameters of a linear model using an L1-norm penalty, reducing the less correlated coefficients to zero [13]. Many new sparse learning approaches for multi-class classification have been suggested, including L2, 1-norm regularized regression models [11].
Since there are several methods for selecting features, in our work, we used filter techniques to rank the predictors under consideration. Then, using the well-known ML regression, we evaluated several potential subsets of selected features using a particular learning algorithm and ranked those that performed the best. Furthermore, this work compares the embedded techniques of penalized linear regression, which technique is helpful since it allows for simultaneous feature selection and prediction [14].

ML of regression
ML regression approaches are rapidly being used. The following fundamental techniques are utilized in the literature: artificial neural network (ANN), SVM, and random forest (RF) [15]. These are effective data-driven techniques. These models can also give regression results in a variety of fields, including energy, environmental, waste and pollution, medical, information technology, finance, and business and economics.
One of the most significant artificial intelligence approaches is ANN [16] and extensions of MLP-ANN models are possibly the most popular and frequently utilized in the field of machine learning prediction. To make precise predictions, these models serve as a reliable predictive tool [17]. [18] use nonlinear regression techniques to anticipate ozone levels based on main pollutants and meteorological variables. The findings produced from the nonlinear regression techniques ANN were satisfactory, and it had shown its robustness as a helpful tool for evaluating and forecasting air quality situations. To improve simulation for medical operations, a versatile and reliable predictor of body size, shape, and ligament thickness is required. Using clinical data, the ANN can predict patient conditions, and it produced more accurate results than traditional regression analysis methods [19]. It is used to validate the prediction of yearly generation rates of household, commercial, and building and demolition wastes for MLP-ANN. MLP-ANN models demonstrated high prediction accuracy, making them useful for forecasting trash generation rates from various sources and potentially a cost-effective strategy for developing integrated municipal solid waste management systems [20]. In business and economics, researchers suggested ANN regression models cope with the challenge of predicting GDP growth. It is demonstrated that the ANN model can predict GDP growth rates significantly more accurately than a corresponding linear model [21]. Moreover, ANN models are appropriate tools for analyzing economic data such as GDP and GDP per capita since they allow for a trade-off between the capacity to predict these features and model size [21,22]. MLP-ANN was used by [23] to predict customer quality in e-commerce social networks. By employing word-of-mouth marketing tactics, MLP-ANN produces a strong model for predicting which referrers will attract high-quality referrals (in terms of transaction volume). And the MLP-ANN technique performed better in terms of estimating building costs [16].
The use of SVM for classification and regression issues has grown significantly. Support vector regression (SVR) is a subset of SVM [24]. SVR techniques are typically used for predicting, and the results are usually satisfactory. As an example, consider a large office building's energy consumption prediction, in which the summer hourly cooling load statistics are utilized as energy consumption data. In terms of accuracy, robustness, and generalization ability, the findings show that the suggested SVR-based technique outperforms generally used methods [25]. [26] utilized the e-SVR and t-SVR using linear, polynomial, radial basis function, and sigmoid kernels for predicting software enhancement effort in the information technology area. When prediction accuracies for both types of SVR were compared to those of statistical regressions, they were statistically better than statistical regression. In banking, SVR and enhanced versions of SVR techniques are used to forecast corporate bond losses in the event of default [27]. Overall, their empirical findings indicate that SVR techniques are a promising method for banks to utilize to anticipate loss given failure.
Random forest RF regression (RFR) is a widely used method for analyzing high-dimensional data. Due to poor predictors, its advantages may be reduced in sparse environments, necessitating a pre-estimation dimension reduction (targeting) phase. Nonetheless, this approach is usually useful for predicting and frequently produces an adequate outcome. For example, [28] suggested RFR-based techniques for battery capacity estimate, with experimental findings demonstrating that the proposed technique is capable of evaluating the health statuses of various batteries and promising for online battery capacity estimation. Furthermore, when the RFR performance was compared to the multiple linear regression (MLR) techniques, RFR has a significantly better predictive potential than a typical linear regression model. RFR is regarded as a particularly promising approach for large-scale modeling of groundwater nitrate contamination [29]. In terms of economics, [30] used RFR to estimate GDP at the town scale versus MLR, with the RFR model achieving considerably greater accuracy.

Methodology
3.1 ML feature selection process Figure 1 depicts the merging of two major approaches in the feature selection process. To begin with, we use correlation and principal component analysis (PCA) approaches to identify the possible feature set. Second, the embedded technique employs the ML of linear regression analysis approaches, namely LASSO and Elastic-net (Enet) regression. In the case of LASSO and E-net, the selected feature can be validated continuously after the model has been trained. Remembering the probable feature set of filter methods, the data associated with those feature sets will be trained using ANN, MLP-ANN, SVR, RFR, and Ridge regression. LASSO or E-net have a high accurate subset selection but lack optimal prediction rates. The dataset of features selected by both regression algorithms is then used to supervise using the suggested ML regression method. The models based on specified sets of features will be validated continuously using the test dataset. Finally, the model's performance will be evaluated.

LPI of World Bank
The World Bank's Logistics Performance Index (LPI) is a biennially announced indicator for judging a country's logistics performance. For this research period from 2010 to 2018 (5 periods), there are 134 countries with LPI information available (see Table 11 of the 'Appendix').

The economic statistics data of S&P global market intelligence
The economic features are derived from the macroeconomic data of S&P Global Market Intelligence's economic statistics. The economic and demographic statistics data provide the macroeconomic attribute, which includes 52 features divided into five categories: (1) market size and growth (total of 10 features), (2) macroeconomic stability (total of 17 features), (3) personal income and labor (total of 9 features), (4) external sector (total of 13 features), and (5) tax rates (total of 3 features).

Data preparation
According to data preparation, in the study of 5 periods from 2009 to 2018 (biannual average to matching to LPI data), initial 26 features were selected, which are accessible for 100 countries when mapping to the LPI data (see Table 11 of the 'Appendix').  Fig. 1 The ML feature selection process motivated by the trade-off between the number of instances and the missing value of country economic data. When the instance must be maximized due to the typically high accuracy even with big datasets [31], the missing value must be minimized. Because missing values might cause bias and reduce analytical efficiency, they should be avoided [32].
For the dataset of 500 instances (5 periods of 100 countries), 70% of the data is utilized as a training set for ML methods, while the remainder is used to test or validate model performance. Furthermore, just one missing item in the dataset is replaced with a mean value based on the country's economic attribute.

Data analysis and parameter setting
3.3.1 Feature selection 3.3.1.1 Correlation The Pearson correlation coefficient is a measure of the linear correlation between two variables in a collection of variables in statistics. It has a value between -1 and ? 1, where ? 1 represents completely positive linear correlation, 0 represents nonlinear correlation, and -1 represents completely negative linear correlation [33]. Pearson correlation coefficient r is defined as follows: where n is the sample size; x i and y i are the individual sample points indexed with i; x is the sample mean representing as 1 n P n i¼1 x i , and analogously for y. The correlation matrix is used to select the subset of relevant features. In this study, the high correlated input with output attributes will be considered [34] stated that it will be weak positive correlation if r value = 0 to 0.25, fair positive correlation if r value = 0.25 to 0.5, good correlation if r value = 0.5 to 0.75, excellent correlation if r value is more than 0.75. It's also a similar rank to the negative correlation if r = [0, À 1]. A tool for scientific data analysis of correlation package in Microsoft Excel is utilized to obtain the result. The suggested Excel tool is a cost-effective alternative to costly software, and it is simple to use for basic data analysis.
3.3.1.2 PCA PCA is arguably the most common multivariate statistical technique for reducing data with multiple dimensions, and it is frequently used to reduce well-being indicators to a single index of well-being. PCA was performed in this study using RStudio (version 4.1.1), which was utilized to extract the major components input and output variables after the input and output datasets were centered and scaled. On the one hand, while using PCA, the factor loading criterion is set to 0.3, which means that only variables with absolute factor loading equal to or greater than 0.3 are considered. The varimax factor rotation approach was used to minimize variables with excessive loading on a factor to assure factor interpretability because a loading less than 0.3 is regarded as insignificant [35]. PCA-biplots, on the other hand, were created for feature selection [36]. PC1 is designated as dimension one, whereas PC2 is designated as dimension two. To display and validate, a biplot-based PCA technique is employed, which is a form of statistics graph that may be used to depict the relationships between multiple parameters [37]. Vectors are the projector variables in a PCA-based biplot. Furthermore, the attributes were centered and scaled during the PCA preprocessing.

Penalized linear regression
Penalized regression models (for example, least absolute shrinkage and selection operator (LASSO) or Elastic-net (E-net)). The high performance of LASSO and E-net is since these models avoid overfitting and minimize model complexity by penalizing the size of coefficients [38]. These models use the simultaneous feature selection-prediction process. [39] introduced LASSO, a famous penalized technique for choosing individual variables, which is based on the following stated model: Finding the elements of b that equal zero is how the feature selection issue is expressed. Estimates are selected by where y i is the dependent variable, x i indicates the predictor variables, b 0 intercept and b 0 show the unknown parameters of the regression equation. Furthermore, t [ 0 is a tuning parameter that governs the degree of shrinkage applied to the estimations. It is the same as minimizing, For high coefficient estimations, this is the common least squares with a penalty term set by k. P b j is a coefficient vector constraint that yields a sparse solution vector b k ; as increases k, more members of b k become zero. LASSO is able to choose a limit of N À 1 features, where N is the sample size [40]. This might be an issue when performing a regression with a limited number of samples but a large number of features [41]. The elastic-net approach attempts to overcome the constraints of the LASSO technique; it is especially effective when there are numerous correlated features [42]. This model is equivalent to minimizing, The values for a of Elastic-net lie between a ¼ 0; 1 ½ , if a ¼ 1 it is the formulation of the LASSO algorithm.
In this paper, the penalized linear regression analysis was carried out using RStudio in which the glmnet package is employed. In addition, the datasets were centered and scaled, and tenfold cross-validation was performed to produce internally valid performance metrics.

Regression and validation
3.3.2.1 ANN Three critical aspects influence ANN: the unit's input and activation functions, network architecture, and the weight of each input connection [43]. It is composed of three levels of nodes (neurons), namely the input, hidden, and output layers (Fig. 2a). The data sample is accepted by the input layer, and the target category is returned by the output layer [44]. The neuron, the fundamental unit of these networks, mimics the human counterpart, having dendrites for taking input variables and emitting an output value that may be used as input for other neurons [45]. The neural network's layers of fundamental processing units are interconnected, with weights assigned to each connection [46], which are changed during the network's learning process. This step improves not only the interconnections between the layers of neurons, but also the parameters of the transfer functions between one layer and another, reducing mistakes. Finally, the neural network's final layer is in charge of combining all of the signals from the preceding layer into a single output signal-the network's reaction to specific input data [15].
A basic ANN structure is depicted in Fig. 2b, which includes neuron connections, biases assigned to neurons, and weights assigned to connections. Two equations can be used to identify a neuron k [47]: and where x 1 , x 2 , …, x n are the inputs, w k1 , w k2 , …, w kn are the neuron weights, u k is the result of weighted input calculation, b k is the bias term, f Á ð Þ is the activation function, and y k is the output. There are numerous algorithms that may be used to train a network [43].
The MATLAB 2020b Neural network toolbox was used in this investigation. The ANN was built using the default network and parameters of hidden layer sizes of 1 Â 10 (one hidden layer with ten nodes). For prediction tasks, a feed-forward ANN with backpropagation learning has been built as a default. Since then, backpropagation has been the most often used supervised algorithm [48]. TRAINLM is a network training function that uses the Levenberg-Marquardt optimization technique to alter the weight and bias variables. TRAINLM is a fast algorithm, although it takes up more memory than other algorithms. LEARNGDM (Gradient descent with momentum weight and bias learning function) is used for error minimization. This function computes the weight change regarding a specific neuron while accounting for the input and error terms, weight and bias, learning rate, and momentum term of the neuron, and is equal to gradient descent with momentum backpropagation. The tangent sigmoid function (TANSIG) is used as a transfer function in the following equation for the input variable x [49]: TANSIG is employed in both the hidden and output layers. They calculate the output based on the net input. The values returned by this activation function range from 1 to ? 1.

MLP-ANN
The network architecture refers to the structure of connectivity between distinct neurons in ANN. One of the most frequent and useful ANN architectures is the multi-layer perceptron (MLP) network. Each neuron in MLP-ANN is linked to many of its neighbors, with variable weights indicating the relative importance of the individual neuron inputs to the other neurons. MLP is a type of network that belongs to the feed-forward ANN family, and its learning method is backpropagation [50].
In this study, ANN multilayer perceptron is designed based on hidden layer sizes of 10 Â 10 (ten hidden layers with ten nodes each). The MATLAB 2020b Neural network toolbox which parameter setting similar to the abovementioned ANN was used in this investigation.

SVR
SVR is an analytical technique used to explore the connection between one or more predictor variables and a real-valued (continuous) dependent variable [51]. When addressing nonlinear problems, SVR uses a kernel function to transfer the nonlinear regression problem to a higher latitude space, allowing it to determine the best hyperplane to separate the sample points [ where X i is the sample data; l is the sample size; C is the penalty coefficient; e surpasses the penalty size of the error sample; K Xi; Xj ð Þ is the kernel function to the optimal solution of a.
In RStudio SVR setting as library caret for classification and regression training, library e1071 for a regression machine in which epsilon-regression type is applied, the radial basis kernel is used in predicting method, and cost of constraints violation is set as default (= 1). In the case of a probabilistic regression model, the fitted model of the sigma parameter is the scale parameter of the hypothesized (zero-mean) Laplace distribution calculated by maximum likelihood.

RFR
The RF technique is a tree-based ensemble approach that was created to overcome the limitations of the classic classification and regression tree (CART) method [52]. RFR is an ensemble learning technique that employs regression algorithms and decision trees [53]. The RF regression technique employs regression trees as base learners. N bootstrapped sample sets are taken from the source dataset to train the RF [52]. Following the selection of the forest's number of trees (C), each regression tree is built on a different bootstrap sample. As split candidates, only a limited and fixed number of randomly picked K predictors are chosen. The procedures are then repeated until C such trees are formed, and fresh data are anticipated by aggregating the C trees' predictions. An RF regression predictor is denoted as [52]: where x is the vectored input variable, C is the number of trees, and T i x ð Þ is a single regression tree created from a subset of input parameters and the bootstrapped samples.
In RStudio RFR, we use the libraries caret for classification and regression training, and randomForest to implement Breiman's random forest method for classification and regression. Number of trees to grow, or ntree = 500, this should not be set too low to guarantee that every input row is forecasted at least a few times [54]. The regression problem's default parameters are mtry or the number of variables randomly selected as candidates at each split, and importance = TRUE.
3.3.2.5 Penalized linear regression One kind of penalized linear regression is ridge regression. This approach has the potential to reduce the magnitude of the regression coefficients, resulting in improved generalizability for predicting unseen data [53]. The ridge coefficients are calculated using the following equation: Ridge regression has one apparent drawback: it includes all predictors in the final model. It will reduce all of the coefficients toward zero, but not precisely [55]. The LASSO and E-net regression are newer alternatives to Ridge regression that help to address this limitation. As previously stated, this study used LASSO and E-net regression for feature selection and prediction, as Ridge regression is only employed to carry out the regression technique.
Similar to LASSO and E-net regression, RStudio of the glmnet package is also employed in Ridge regression with centered and scaled datasets pre-processing and tenfold validation are utilized.

Performance evaluation
The mean absolute errors (MAE): mean absolute percentage errors (MAPE): Root-mean-square error (RMSE): Nash -Sutcliffe efficiency coefficient (NSE): and determination coefficient (R 2 ): will be used to evaluate the performance of the models for the prediction validation method [56][57][58][59], where N is the amount of validation data, y is the real output of LPI score which y is its average value andŷ is the prediction of the output of LPI score whichŷ is the average of the predicted value.

Analysis time
Using the big O-notation analysis, we determine the theoretical computational time complexity of ML models. The O-notation is used to present any asymptotic computing characteristics by estimating the worst-case computational time.  Figure 3 depicts the result of the correlation study performed using the Microsoft Excel data analysis tool. When a regression type prediction is used, the input of a correlation model spanning both the dependent and predictor variables is used. We begin by constructing a feature set of  predictor factors that have a direct good or outstanding correlation to the dependent variable of LPI (r ! 0.5, as shown in the red border in Fig. 3) (namely set A). A set A's predictor variables are X 4 (GDP_C; r ¼ 0.75), X 18 (Exp; r ¼ 0.52), and X 19 (Imp; r ¼ 0.5), for a total of three features. Furthermore, the predictor variables that have a strong or outstanding correlation with a member of set A (r ! 0.5, as shown in the yellow border in Fig. 3) are taken into account and subsequently extended to a member of set B. The additional predictor variables of a set A into set B include X 2 (N_GDP; r ¼ 0.83 (with Exp) and r ¼ 0.93 (with Imp)), X 16 (LF; r ¼ 0.62 (with Exp) and r ¼ 0.57 (with Imp)), X 21 (FER; r ¼ 0.6 (with Exp) and r ¼ 0.52 (with Imp)), X 22 (Iwd_DI; r ¼ 0.54 (with Exp) and r ¼ 0.59 (with Imp)), and X 23 (Owd_DI; r ¼ -0.56 (with Exp) and r ¼ -0.6 (with Imp)) The total number of features in set Fig. 4 The percentage of variation based on principle component   B is 3 ? 5 = 8. Table 4 shows a summary of the subset of features selected using the correlation approach.

The result of PCA method
The PCA result is generated by RStudio in which the dependent variable and predictor variables are used as PCA model input to select the feature for the regression purpose [56]. Figure 4 depicts the proportion of variance of each principal component based on the overall result (only PC1 to PC10 out of a total of 27 PCs). The first and second principal components (i.e., PC1 and PC2) exhibited 34.9 percent variance, whereas PC1 through PC10 may encompass roughly 80 percent of the variation (81.41 percent). Furthermore, when PC1 to PC3 were evaluated, the variation was 46.33 percent, which is more than half of the range of PC1 to PC10. When PC1 to PC5 is considered half of the 10 PCs from PC1 to PC10, the variance is 62.12 percent. To construct a collection of selected features, we examined the attribute that provides a high loading on a factor (equal to or greater than 0.3). The detected attributes in PC1 to PC3 (46.33 percent variance), PC1 to PC5 (62.12 percent variation), and PC1 to PC10 (81.41 percent variation) are allocated to feature sets C, D, and E, respectively. Set C such as X 1 (R_GDP_Gr), X 2 (R_GDP_Gr), X 4 (GDP_C), X 5 (Pri_C_Gr), X 11 (BB/GDP), X 14 (BE/GDP), X 15 (GNS_Rt), X 18 (Exp), X 19 (Imp), X 22 (Iwd_DI), and X 23 (Owd_DI), 11 features by total. Set D of 16 features is set C plus 5 features which are X 8 (CAB), X 9 (CP_G), X 12 (GDP_D), X 21 (FER), and X 24 (TB). And set E has a total of 24 from the overall 26 features that exclude X 16 (LF) and X 17 (CAB/GDP). Furthermore, the feature selection while constructing a PCA-biplot is illustrated in Fig. 5, with the selected features represented by blue vectors. The selection is motivated by the interrelationships of each feature to LPI. The direction of the feature vector reflects the positive or negative correlations [65]. When a feature has a comparable direction that is the smallest in the angle of the vector relative to the LPI vector, it indicates the strongest positive correlations, while the opposite direction indicates negative correlations. Vectors close to perpendicular to the LPI vector, on the other hand, are weakly correlated (orange vectors in Fig. 5.) Based on the PCA-biplot, the selected features of set F, i.e., X 2 (R_GDP_Gr), X 4 (GDP_C), X 16 (LF), X 18 (Exp), X 19 (Imp), X 21 (FER), X 22 (Iwd_DI), X 23 (Owd_DI), and X 25 (MMI_Rt), which are 9 features in total. The summary of the subset of features selected using the PCA method is shown in Table 4. Table 3 displays the results of the LASSO and E-net penalized linear regression methods. Using RStudio, the model for LASSO regression of Eq. (4) has been reducing the predictor parameters from 26 to 10, which offer various interception values and parameter significance. The 9 features selected by LASSO (set G) include X 2 (R_GDP_Gr), X 3 (Pop_Gr), X 4 (GDP_C), X 9 (CP_G), X 13 (PD/GDP), X 18 (Exp), X 19 (Imp), X 25 (MMI_Rt), and X 26 (DC_Gr). For Elastic-net related to Eq. (5), in this study, we vary the a as 0.1, 0.25, 0.5, 0.75, and 0.9. The results of feature selection from RStudio which provides the preferred parameters that the model does not shrink are displayed in Table 3. It was found that when a = 0.9 the set of selected features is similar to the results of LASSO. When a is assigned with the value of 0.25, 0.5, and 0.75, they provide the likely set of 10 selected features (set H). Finally, for a = 0.1, we found that 15 parameters were nonshrink (set I). Set H contains all attributes of set G which X 15 (GNS_Rt) is added. And set I comprised all elements of set G with X 1 (R_GDP_Gr), X 14 (BE/GDP), X 17 (CAB/ GDP), X 20 (NDIF), and X 21 (FER) combined. The summary of the subset of features selected using penalized linear regression method is shown in Table 4.

Regression and validation result
According to the subset of selected features (set A to set I), 70% of datasets are trained utilizing identified ML methods such as ANN, MLP-ANN, SVR, RFR, and Ridge. Furthermore, the LASSO and E-net models constantly train their datasets using only the selected feature set that they have been trained on. Furthermore, the entire collection of all features is compared. The test sets are then utilized to validate the model. The validation findings are represented by a performance evaluator or criterion such as MAE, MAPE, RMSE, NSE, and R 2 .
The summary of performance evaluation findings is given in Table 5 and Fig. 6 for easier comparison. SVR of a feature set I has the greatest MAE performance (*0.1349) (minimum value), while this model with a comparable set has the best MAPE performance as the lowest value (*4.6387). For RMSE, ANN with a comparable feature set I achieves the best performance (*0.1808) (minimum value). NSE values range between À1 and 1 to indicate the prediction performance, whereas NSE values close to 1 indicate best prediction performance [57] that the SVR with a comparable feature set I achieves the most excellent performance (*0.8938). Moreover, SVR of a feature set I has the highest R 2 performance (*0.8964) (maximum value). And if R 2 [ 0.8, then there is a strong correlation between actual values and model estimations [16].
When ML is examined using the average value based on each model for all feature sets, the best performing model is ANN, with average MAE, MAPE, RMSE, and NSE values of **0.1525, **5.1775, **0.1985, and **0.8691, respectively. When compared to ANN, MLP-ANN, SVR, and RFR show satisfactory performance for all criteria. When compared to each admissible model, the performance of all penalized linear regression methods (Ridge, LASSO, and E-net) is a lesser amount of performance. We determined the average performance of admissible ML models that omit penalized linear regression approaches when we focused on the set of selected features. Set H has the greatest performance for MAE, MAPE, and R 2 (***0.1497, ***5.1452, and ***0.8803, respectively), whereas set C has the best performance for RMSE and NSE (***0.1975 and ***0.8728, respectively).
Because of a different perspective or set of criteria produces a different set of optimal results, hence, in this study, we reprocessed using the feature union and intersection operations. To reorganize the acquired feature subsets, a feature union and intersection procedure are presented [66]. Table 6 Table 7 shows a summary of the ML regression model performance evaluation of the reprocess of a new feature set that has been merged with the parent sets.  [67] noted that when the connections between parameters become noninvertible (due to a large number of predictor variables), the input and output configurations used in ANN have a major influence on the accuracy. Furthermore, ANN outperforms linear models in terms of accuracy (where the number of important predictor variables is restricted) [68]. Moreover, concentrating on the set of selected features, we determined the average performance of the four suitable ML models described above. The best performance is exhibited in set C [ H, which offered the best for all performance metrics, namely MAE, MAPE, RMSE, NSE, and R 2 as ***0.1463, ***5.0066, ***0.1908, ***0.8811, and ***0.888, respectively. The members of set C [ H which affect the accuracy of LPI when predicting are X 1 (R_GDP_Gr), X 2 (R_GDP_Gr), X 3 (Pop_Gr), X 4 (GDP_C), X 5 (Pri_C_Gr), X 9 (CP_G), X 11 (BB/GDP), X 13 (PD/GDP), X 14 (BE/GDP), X 15 (GNS_Rt), X 18 (Exp), X 19 (Imp), X 22 (Iwd_DI), X 23 (Owd_DI), X 25 (MMI_Rt), and X 26 (DC_Gr), total 16 features. As previously stated, the instance must be maximized due to the normally high accuracy even with large datasets, and the number of missing values must be minimized because it can be involved in reducing bias and improving the efficiency of the analysis; furthermore, a limitation on the number of features may support this. Taking into account the parent features of C and H, this limits the number of features to 11 and 10, respectively. Those features may be used as an alternative since they give an adequate performance (closest to the best). The other explanation, set C, is supplied by the PCA method, which is one of the algorithms with higher performance than the other algorithms, resulting in many studies. While the penalized linear of E-net regression provides set H, this regression technique has accurate subset selection but lacks optimum prediction rates. The four acceptable ML models do not differentiate from each other for the best performance set of C [ H and the parent set of C and H based on the errors shown by the boxplots (Fig. 7), and the RFR is circled as the biggest error values. However, it is discovered that the extreme error levels of all models are nearly the same. Furthermore, Taylor diagrams were created for the evaluation of the acquired results, and they allow for the determination of the correctness of the developed models in many areas [57,58]. Figure 8 clearly shows that the prediction results of the set of C [ H and the parent set of C and H based on the four acceptable ML models are close to the observations. What is interesting about the findings shown in Fig. 8c is that the ANN model outperformed the other models for the set of C [ H that gives the shortest distance to the observation. The statistical significance of the acquired data was examined using the Kruskal-Wallis test in this study, as well as an analysis of whether the predicted and observed or logistics performance index distributions, were consistent [57,69]. H 0 denotes a hypothesis based on the statistically significant difference between mean predicted and observed LPI values. Table 8 reveals that the H 0 hypothesis was rejected (P value ! 0.05) in all C, H, and C [ H set predictions; in other words, there is no significant difference between predicted and observed averages. H 0 hypotheses were rejected, similarly, this indicates all of the ANN, MLP-ANN, SVR, and RFR models produce more accurate results. This suggests that the pre-processing of data preparation and feature selection had a statistically significant beneficial influence on ML predictions. Table 9 shows the analysis time of ML algorithms provided by analysis tools (MATLAB shows the values in second and RStudio shows the values in millisecond). The average analysis time in the MLP-ANN training procedure was longer (a constant of eight seconds approximately for all sets), otherwise less than a second.

Discussion
Finally, as shown in Table 10, we discussed the finding outcomes based on both feature selection techniques of filter and embedding method which is focused on the suggested statistical property and ML algorithm. The discussion describes the advantages and disadvantages of models that influence the findings of this study.
To get good results, effective wrapper strategies, such as sequential search, or evolutionary algorithms, such as Particle Swarm Optimization (PSO) or Genetic Algorithm (GA), provide local optimum solutions and are computationally viable, are utilized. Because of the potential of overfitting and computationally costly [72], wrappers have a significant disadvantage, particularly in terms of computational inefficiency, which becomes more obvious as the feature space develops. The wrapper technique is thus eliminated from this analysis, although it will be significant in future studies.

Concluding remark and future work
In conclusion, the current study illustrates an application of machine learning regression to feature selection. In this study, we looked at the impact of logistics performance utilizing the World Bank's LPI and the economic attributes of S&P Global Market Intelligence's macroeconomic data source. The 500 case samples ranged from 2009 to 2018, with an initial set of 26 economic features accessible. Furthermore, the number of instances (maximize) and the missing value of nation economic data have been tradedoff in the first feature selection (minimize). The filter methods of correlation and PCA are employed in the suggested feature selection procedure. The ML regression algorithms ANN, MLP-ANN, SVR, RFR, and Ridge are then utilized to train and verify the data set depending on the selected feature. To select the feature, the embedded technique of penalized linear regression of LASSO and E-net is also used, followed by continuous training and validation of the dataset. In feature selection, the proposed ML regression uses a subset of penalized linear regression features to train and validate the dataset. According to the results of the model's performance based on the MAE, MAPE, RMSE, NSE, and R 2 criteria, the feature set of PCA (set C), and E-net (set H and I) offer the most closely acceptable performance.
Then, using parent sets (C, H, and I), a feature union and intersection operation are performed. Finally, the set of C [ H (a total of 16 features) performs the best across all criteria. Furthermore, the findings may address the study issue that ML algorithms can select the appropriate set of economic features that reflect the country's logistics performance. In response to the question: what is the best ML regression technique for predicting logistics performance based on selected economic attributes? The findings indicate that ANN is the most effective model for prediction in this study. Furthermore, we note that features C and H limit the number of features to 11 and 10, respectively. Those features may be used as an alternative since they give an adequate performance (near to the best) when it is necessary to maximize the instance and reduce the missing data of the dataset.
Furthermore, in a future study, the focus may be on utilizing more diverse feature dimensions integrated with economic attributes. The unique elements connected to the megatrend, such as the carbon emissions rate, the cost and consumption rate of fuel and renewable energy, the e-commerce market size, and growth, may reflect on logistics performance in the new era of a global supply   Lower computational cost and time [72] Good generalization ability [72] Simply scaling some of the criteria might result in significant changes to the results [73] The PCA approach provides the potentially important feature set of Set C. That it has a low absence of valuable features is one of its advantages. However, the difference in criterion, i.e., the factor loading and percentage of variation, is a major change that will modify the result of the feature set based on its drawback ANN Better at identifying very complex patterns and making accurate predictions [74] Model networks are used to approximate or estimate functions that generally need a considerable quantity of training data [74] Because the pattern of the LPI and economic attributes of various countries is extremely complex because the research data contains different economic backgrounds based on country, i.e., underdeveloped, developing, and developed countries. According to its advantages, ANN is the most effective model for prediction in these characteristics of the data pattern

MLP-ANN
When the amount of input features and data complexity is substantially bigger, utilize non-linear correlations among variables to produce more accurate predictions [75] Slow convergence velocity, resulting in a preference for the local optimum [76] Neural networks require a large number of parameters and are unable to see the learning process among them [76] The result is difficult to comprehend, and the learning time is excessive [77] MLP-ANN may benefit from data complexity as well, but its performance is lower than that of ANN because of the limitation of modification network design in this investigation (layer sizes of 1 9 10 fixed), which may prevent it from observing the optimum effective network in this study chain. The enrichment work also extends to the wrapper technique.

Appendix
See Table 11 and Table 12. High generalization ability with a limited training sample set [78] Nonlinear problems are difficult to solve, and it might be challenging to identify an appropriate kernel function [77] When the number of observation samples is enormous, the effectiveness of SVR may be low [78] SVR achieves an acceptable result, demonstrating adequate performance across all criteria. However, this study may have had a disadvantage due to the difficulty in solving nonlinear problems RFR Can handle high-dimensional data, has a quick training speed, and can recognize the mutual impact of features [77] It can balance mistakes in uneven data sets, and even if a major portion of the features are destroyed, accuracy can still be preserved [77] In some noisy classification or regression issues, there will be overfitting, and attributes with more values will have a higher influence on the results [77] RFR receives an acceptable result, demonstrating adequate performance across all categories. However, this work may have suffered from overfitting in some noisy regression issues with higher values of economic attributes

Ridge
Capable of dealing with highly correlated environmental variables (multicollinearity) [78] Is useful when the amount of data is small in comparison to the number of variables [78] The estimations are biased [78] Ridge regression results are restricted in this study since the data set contains poorly correlated environmental variables Embedded LASSO Provides an interpretable model and selects a subset of predictors having the greatest influence on the response variable [79] When less data is available, it might be utilized for feature selection [79] Selects one covariate at random from a set of highly collinear variables to incorporate in the model and discards the others [79] As an advantage, LASSO may be utilized for feature selection. However, when compared to other ML algorithms of the filter approach, it has limitations when used in the regression process E-net Performs well when the number of parameters is greater than the number of samples [79] Provides a more stable and interpretable model than the LASSO [79] Useful for feature selection [79] Cannot be utilized for feature selection when there is a limited amount of data available because it overwhelms the data with too many model variables [79] This strategy provides a potentially important feature set of set H and set I, which may be used as the parent set for feature union and intersection operations. However, when compared to other ML algorithms of the filter approach, it has limitations when used in the regression process

Declarations
Conflict of interest The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visithttp://creativecommons. org/licenses/by/4.0/.