1 Introduction

The World Bank's Logistics Performance Index (LPI) is well-known practical information available to policymakers for judging a country's logistics performance [1]. The LPI, which has been assessing nations' logistics performance on a biannual basis by analyzing survey data, is perhaps the most important tool to emerge from trade facilitation. It delivers a macroeconomic overview on how policymakers may favorably affect global supply chain capabilities and the performance of relevant businesses representing the efficiency of the clearance process, trade quality, and transportation-related infrastructure [2]. Perhaps it would be advantageous if there was constant and up-to-date information reflecting the country’s logistical efficiency. It may monitor changes in variable data and continuously evaluate trends or predict logistics efficiency to provide policymakers with rapid access to projected logistics performance to improve the country's logistics and supply chain capabilities.

Nonetheless, prior research has shown that variables such as institutional reforms and resource enhancements significantly accelerate logistics performance. Countries with a low level of corruption and a stable political environment, according to [3], are more likely to have a high level of logistics performance, and improvements in resource supply such as infrastructure, technology, labor, and education related to the country's competitiveness have a significant positive effect on performance. Similarly, [1] emphasizes that governance weaknesses and societal instability might reduce performance. However, the aforementioned study variables are static data, and certain components were also collected through the survey.

Since then, the components, particularly the economic component, have demonstrated a significant link between a nation's logistics performance (expressed by LPI scores) and an economic element, such as the country's economic development indicator [1, 4]. For example, GDP per capita [1], export and import volume show LPI components have a substantial beneficial influence on expanding international commerce for both import and export [5]. There are, nevertheless, unstudied substantial economic aspects that connect with logistics efficiency.

This study demonstrates how to profit from up-to-date dynamic economic big data, which contributes to the selection of economic attributes that indicate logistics performance as reflected by the LPI. The analytical technique employs a high degree of productivity in the field of machine learning (ML) for prediction or regression using an adequate set of economic feature subsets. Because the accuracy of ML prediction outcomes is reliant not only on the model structure and associated training algorithm but also on the feature space constructed using the initial feature set and feature selection algorithm [6]. In ML applications, feature selection is often employed as a portion of the pre-processing phase to get a subset of features by reducing elements with minimal predictive information [7].

There are two aims in this study: (1) to determine the ideal subset of economic features that best represents a particular anticipated variable for predicting a country's logistics performance and (2) to improve prediction accuracy by employing a set of alternative ML regression algorithms. This research looks into two major research questions: to begin, can ML algorithms assist in selecting the proper subset of economic features that reflect the country's logistics performance? Second, what is the appropriate ML regression approach for predicting logistics performance based on certain economic attributes? This paper's structure is as follows. Section 2 includes a review of the literature on feature selection and regression machine learning. Section 3 contains methodology, which includes the ML feature selection procedure, data sources and data preparation, data analysis, and parameter setup. Sections 4 and 5 provide the results and discussion, as well as the concluding remark and future work.

2 Literature review

2.1 Feature selection

The process of selecting a subset of important features, especially variables, for model building is known as feature selection. It is well recognized that a subset of relevant features may be beneficial in improving model performance. With minimum information loss, the feature selection process aims to eliminate duplicate or superfluous features and other features that are closely connected in the data. It is often used to make the model more comprehensible and to improve generality by decreasing variance [8]. The three forms of feature selection strategies are filter, wrapper, and embedded.

To rank all features, filter techniques rely on statistical properties. The challenge of feature selection is viewed as a ranking problem by the majority of filter techniques. These techniques are independent of the ML learning algorithm that will be used with the selected subset [9]. This category comprises mutual information-based, correlation-based, Chi-square test-based and principal component analysis-based techniques. Filter techniques are often used in high-dimensional datasets because of their processing efficiency [8].

Wrapper techniques select feature subsets based on how useful they are to a certain predictor or classifier. Selection is viewed as a search problem in these techniques, with various feature combinations created, evaluated, and compared to other combinations. The search is driven by heuristic intelligent optimization techniques. Simplified methods, such as sequential search, or evolutionary algorithms, such as particle swarm optimization (PSO) or genetic algorithm (GA), that generate local optimal results and are computationally viable, are used to achieve good results [6]. As a selection criterion, these approaches use the performance of the inductive algorithm. They wrap the learning algorithm with feature selection and estimate the advantages of adding or deleting a feature using cross-validation [9]. Wrapper techniques outperform filter methods because each cycle, a new prediction model or learning algorithm evaluates a different feature subset [10].

By embedding feature selection within the model learning, embedded methods provide a trade-off solution between filter techniques and wrapper methods. They return both the learned model and the selected features at the same time [11]. The learning and feature selection components of embedded techniques cannot be separated [12]. Embedded techniques include feature selection into the model training process; for example, starting regularization processes while the model is being trained is a frequent example [6]. Regularization models such as sparse linear discriminant analysis regularized support vector machine (SVM) and LASSO are the most commonly used embedded approaches [11]. For the instance of an embedded method, the LASSO technique normalizes the parameters of a linear model using an L1-norm penalty, reducing the less correlated coefficients to zero [13]. Many new sparse learning approaches for multi-class classification have been suggested, including L2, 1-norm regularized regression models [11].

Since there are several methods for selecting features, in our work, we used filter techniques to rank the predictors under consideration. Then, using the well-known ML regression, we evaluated several potential subsets of selected features using a particular learning algorithm and ranked those that performed the best. Furthermore, this work compares the embedded techniques of penalized linear regression, which technique is helpful since it allows for simultaneous feature selection and prediction [14].

2.2 ML of regression

ML regression approaches are rapidly being used. The following fundamental techniques are utilized in the literature: artificial neural network (ANN), SVM, and random forest (RF) [15]. These are effective data-driven techniques. These models can also give regression results in a variety of fields, including energy, environmental, waste and pollution, medical, information technology, finance, and business and economics.

One of the most significant artificial intelligence approaches is ANN [16] and extensions of MLP-ANN models are possibly the most popular and frequently utilized in the field of machine learning prediction. To make precise predictions, these models serve as a reliable predictive tool [17]. [18] use nonlinear regression techniques to anticipate ozone levels based on main pollutants and meteorological variables. The findings produced from the nonlinear regression techniques ANN were satisfactory, and it had shown its robustness as a helpful tool for evaluating and forecasting air quality situations. To improve simulation for medical operations, a versatile and reliable predictor of body size, shape, and ligament thickness is required. Using clinical data, the ANN can predict patient conditions, and it produced more accurate results than traditional regression analysis methods [19]. It is used to validate the prediction of yearly generation rates of household, commercial, and building and demolition wastes for MLP-ANN. MLP-ANN models demonstrated high prediction accuracy, making them useful for forecasting trash generation rates from various sources and potentially a cost-effective strategy for developing integrated municipal solid waste management systems [20]. In business and economics, researchers suggested ANN regression models cope with the challenge of predicting GDP growth. It is demonstrated that the ANN model can predict GDP growth rates significantly more accurately than a corresponding linear model [21]. Moreover, ANN models are appropriate tools for analyzing economic data such as GDP and GDP per capita since they allow for a trade-off between the capacity to predict these features and model size [21, 22]. MLP-ANN was used by [23] to predict customer quality in e-commerce social networks. By employing word-of-mouth marketing tactics, MLP-ANN produces a strong model for predicting which referrers will attract high-quality referrals (in terms of transaction volume). And the MLP-ANN technique performed better in terms of estimating building costs [16].

The use of SVM for classification and regression issues has grown significantly. Support vector regression (SVR) is a subset of SVM [24]. SVR techniques are typically used for predicting, and the results are usually satisfactory. As an example, consider a large office building's energy consumption prediction, in which the summer hourly cooling load statistics are utilized as energy consumption data. In terms of accuracy, robustness, and generalization ability, the findings show that the suggested SVR-based technique outperforms generally used methods [25]. [26] utilized the ε-SVR and υ-SVR using linear, polynomial, radial basis function, and sigmoid kernels for predicting software enhancement effort in the information technology area. When prediction accuracies for both types of SVR were compared to those of statistical regressions, they were statistically better than statistical regression. In banking, SVR and enhanced versions of SVR techniques are used to forecast corporate bond losses in the event of default [27]. Overall, their empirical findings indicate that SVR techniques are a promising method for banks to utilize to anticipate loss given failure.

Random forest RF regression (RFR) is a widely used method for analyzing high-dimensional data. Due to poor predictors, its advantages may be reduced in sparse environments, necessitating a pre-estimation dimension reduction (targeting) phase. Nonetheless, this approach is usually useful for predicting and frequently produces an adequate outcome. For example, [28] suggested RFR-based techniques for battery capacity estimate, with experimental findings demonstrating that the proposed technique is capable of evaluating the health statuses of various batteries and promising for online battery capacity estimation. Furthermore, when the RFR performance was compared to the multiple linear regression (MLR) techniques, RFR has a significantly better predictive potential than a typical linear regression model. RFR is regarded as a particularly promising approach for large-scale modeling of groundwater nitrate contamination [29]. In terms of economics, [30] used RFR to estimate GDP at the town scale versus MLR, with the RFR model achieving considerably greater accuracy.

3 Methodology

3.1 ML feature selection process

Figure 1 depicts the merging of two major approaches in the feature selection process. To begin with, we use correlation and principal component analysis (PCA) approaches to identify the possible feature set. Second, the embedded technique employs the ML of linear regression analysis approaches, namely LASSO and Elastic-net (E-net) regression. In the case of LASSO and E-net, the selected feature can be validated continuously after the model has been trained. Remembering the probable feature set of filter methods, the data associated with those feature sets will be trained using ANN, MLP-ANN, SVR, RFR, and Ridge regression. LASSO or E-net have a high accurate subset selection but lack optimal prediction rates. The dataset of features selected by both regression algorithms is then used to supervise using the suggested ML regression method. The models based on specified sets of features will be validated continuously using the test dataset. Finally, the model’s performance will be evaluated.

Fig. 1
figure 1

The ML feature selection process

3.2 Data sources and data preparation

3.2.1 LPI of World Bank

The World Bank’s Logistics Performance Index (LPI) is a biennially announced indicator for judging a country's logistics performance. For this research period from 2010 to 2018 (5 periods), there are 134 countries with LPI information available (see Table 11 of the 'Appendix').

3.2.2 The economic statistics data of S&P global market intelligence

The economic features are derived from the macroeconomic data of S&P Global Market Intelligence's economic statistics. The economic and demographic statistics data provide the macroeconomic attribute, which includes 52 features divided into five categories: (1) market size and growth (total of 10 features), (2) macroeconomic stability (total of 17 features), (3) personal income and labor (total of 9 features), (4) external sector (total of 13 features), and (5) tax rates (total of 3 features).

3.2.3 Data preparation

According to data preparation, in the study of 5 periods from 2009 to 2018 (biannual average to matching to LPI data), initial 26 features were selected, which are accessible for 100 countries when mapping to the LPI data (see Table 11 of the 'Appendix'). Table 1 displays the 26 possible economic features. The first feature selection is motivated by the trade-off between the number of instances and the missing value of country economic data. When the instance must be maximized due to the typically high accuracy even with big datasets [31], the missing value must be minimized. Because missing values might cause bias and reduce analytical efficiency, they should be avoided [32].

Table 1 The available 26 economic features

For the dataset of 500 instances (5 periods of 100 countries), 70% of the data is utilized as a training set for ML methods, while the remainder is used to test or validate model performance. Furthermore, just one missing item in the dataset is replaced with a mean value based on the country's economic attribute.

3.3 Data analysis and parameter setting

3.3.1 Feature selection

3.3.1.1 Correlation

The Pearson correlation coefficient is a measure of the linear correlation between two variables in a collection of variables in statistics. It has a value between − 1 and + 1, where + 1 represents completely positive linear correlation, 0 represents nonlinear correlation, and − 1 represents completely negative linear correlation [33]. Pearson correlation coefficient r is defined as follows:

$$ r = {{\left( {\mathop \sum \limits_{i = 1}^{n} \left( {x_{i} - \overline{x}} \right)\left( {y_{i} - \overline{y}} \right)} \right)} \mathord{\left/ {\vphantom {{\left( {\mathop \sum \limits_{i = 1}^{n} \left( {x_{i} - \overline{x}} \right)\left( {y_{i} - \overline{y}} \right)} \right)} {\left( {\sqrt {\mathop \sum \limits_{i = 1}^{n} \left( {x_{i} - \overline{x}} \right)^{2} \left( {y_{i} - \overline{y}} \right)^{2} } } \right)}}} \right. \kern-\nulldelimiterspace} {\left( {\sqrt {\mathop \sum \limits_{i = 1}^{n} \left( {x_{i} - \overline{x}} \right)^{2} \left( {y_{i} - \overline{y}} \right)^{2} } } \right)}}, $$
(1)

where \(n\) is the sample size; \(x_{i}\) and \(y_{i}\) are the individual sample points indexed with \(i\); \(\overline{x}\) is the sample mean representing as \(\frac{1}{n}\mathop \sum \nolimits_{i = 1}^{n} x_{i}\), and analogously for \(\overline{y}\).

The correlation matrix is used to select the subset of relevant features. In this study, the high correlated input with output attributes will be considered [34] stated that it will be weak positive correlation if \(r\) value = 0 to 0.25, fair positive correlation if \(r\) value = 0.25 to 0.5, good correlation if \(r\) value = 0.5 to 0.75, excellent correlation if \(r\) value is more than 0.75. It’s also a similar rank to the negative correlation if \(r\) = [0, \(-\) 1]. A tool for scientific data analysis of correlation package in Microsoft Excel is utilized to obtain the result. The suggested Excel tool is a cost-effective alternative to costly software, and it is simple to use for basic data analysis.

3.3.1.2 PCA

PCA is arguably the most common multivariate statistical technique for reducing data with multiple dimensions, and it is frequently used to reduce well-being indicators to a single index of well-being. PCA was performed in this study using RStudio (version 4.1.1), which was utilized to extract the major components input and output variables after the input and output datasets were centered and scaled. On the one hand, while using PCA, the factor loading criterion is set to 0.3, which means that only variables with absolute factor loading equal to or greater than 0.3 are considered. The varimax factor rotation approach was used to minimize variables with excessive loading on a factor to assure factor interpretability because a loading less than 0.3 is regarded as insignificant [35]. PCA-biplots, on the other hand, were created for feature selection [36]. PC1 is designated as dimension one, whereas PC2 is designated as dimension two. To display and validate, a biplot-based PCA technique is employed, which is a form of statistics graph that may be used to depict the relationships between multiple parameters [37]. Vectors are the projector variables in a PCA-based biplot. Furthermore, the attributes were centered and scaled during the PCA preprocessing.

3.3.1.3 Penalized linear regression

Penalized regression models (for example, least absolute shrinkage and selection operator (LASSO) or Elastic-net (E-net)). The high performance of LASSO and E-net is since these models avoid overfitting and minimize model complexity by penalizing the size of coefficients [38]. These models use the simultaneous feature selection–prediction process. [39] introduced LASSO, a famous penalized technique for choosing individual variables, which is based on the following stated model:

$$ E\left( {y{|}X = x} \right) = \beta_{0} + \beta^{\prime}x. $$
(2)

Finding the elements of \(\beta\) that equal zero is how the feature selection issue is expressed. Estimates are selected by

$$ \mathop {{\text{min}}}\limits_{\beta } \sum \left( {y_{i} - \beta_{0} - \beta^{\prime}x_{i} } \right)^{2} , $$
(3)

subject to \(\sum \left| {\beta_{j} } \right| < t\).

where \(y_{i}\) is the dependent variable, \(x_{i}\) indicates the predictor variables, \(\beta_{0}\) intercept and \(\beta^{\prime}\) show the unknown parameters of the regression equation. Furthermore, \(t > 0 \) is a tuning parameter that governs the degree of shrinkage applied to the estimations. It is the same as minimizing,

$$ \mathop {{\text{min}}}\limits_{\beta } \sum \left( {y_{i} - \beta_{0} - \beta^{\prime}x_{i} } \right)^{2} + \lambda \sum \left| {\beta_{j} } \right|. $$
(4)

For high coefficient estimations, this is the common least squares with a penalty term set by \(\lambda\). \(\sum \left| {\beta_{j} } \right|\) is a coefficient vector constraint that yields a sparse solution vector \(\beta_{\lambda }\); as increases \(\lambda\), more members of \(\beta_{\lambda }\) become zero. LASSO is able to choose a limit of \(N - 1\) features, where \(N\) is the sample size [40]. This might be an issue when performing a regression with a limited number of samples but a large number of features [41].

The elastic-net approach attempts to overcome the constraints of the LASSO technique; it is especially effective when there are numerous correlated features [42]. This model is equivalent to minimizing,

$$ \mathop {{\text{min}}}\limits_{\beta } \sum \left( {y_{i} - \beta_{0} - \beta^{\prime}x_{i} } \right)^{2} + \lambda \sum \left( {\alpha \left| {\beta_{j} } \right| + \frac{1}{2}\left( {1 - \alpha } \right)\left\| {\beta_{j}^{2} } \right\|} \right). $$
(5)

The values for \(\alpha\) of Elastic-net lie between \(\alpha = \left[ {0, 1} \right]\), if \(\alpha = 1\) it is the formulation of the LASSO algorithm.

In this paper, the penalized linear regression analysis was carried out using RStudio in which the glmnet package is employed. In addition, the datasets were centered and scaled, and tenfold cross-validation was performed to produce internally valid performance metrics.

3.3.2 Regression and validation

3.3.2.1 ANN

Three critical aspects influence ANN: the unit's input and activation functions, network architecture, and the weight of each input connection [43]. It is composed of three levels of nodes (neurons), namely the input, hidden, and output layers (Fig. 2a). The data sample is accepted by the input layer, and the target category is returned by the output layer [44]. The neuron, the fundamental unit of these networks, mimics the human counterpart, having dendrites for taking input variables and emitting an output value that may be used as input for other neurons [45]. The neural network’s layers of fundamental processing units are interconnected, with weights assigned to each connection [46], which are changed during the network’s learning process. This step improves not only the interconnections between the layers of neurons, but also the parameters of the transfer functions between one layer and another, reducing mistakes. Finally, the neural network’s final layer is in charge of combining all of the signals from the preceding layer into a single output signal—the network’s reaction to specific input data [15].

Fig. 2
figure 2

Architecture of a neural network

A basic ANN structure is depicted in Fig. 2b, which includes neuron connections, biases assigned to neurons, and weights assigned to connections. Two equations can be used to identify a neuron \(k\) [47]:

$$ y_{k} = f\left( {u_{k} + b_{k} } \right) $$
(6)

and

$$ u_{k} = \sum\limits_{i = 1}^{N} {w_{ki} x_{i} ,} $$
(7)

where \(x_{1}\), \(x_{2}\), …, \(x_{n}\) are the inputs, \(w_{k1}\), \(w_{k2}\), …, \(w_{kn}\) are the neuron weights, \(u_{k}\) is the result of weighted input calculation, \(b_{k}\) is the bias term, \(f\left( \cdot \right)\) is the activation function, and \(y_{k}\) is the output. There are numerous algorithms that may be used to train a network [43].

The MATLAB 2020b Neural network toolbox was used in this investigation. The ANN was built using the default network and parameters of hidden layer sizes of 1 \(\times\) 10 (one hidden layer with ten nodes). For prediction tasks, a feed-forward ANN with backpropagation learning has been built as a default. Since then, backpropagation has been the most often used supervised algorithm [48]. TRAINLM is a network training function that uses the Levenberg–Marquardt optimization technique to alter the weight and bias variables. TRAINLM is a fast algorithm, although it takes up more memory than other algorithms. LEARNGDM (Gradient descent with momentum weight and bias learning function) is used for error minimization. This function computes the weight change regarding a specific neuron while accounting for the input and error terms, weight and bias, learning rate, and momentum term of the neuron, and is equal to gradient descent with momentum backpropagation. The tangent sigmoid function (TANSIG) is used as a transfer function in the following equation for the input variable x [49]:

$$ TANSIG\left( x \right) = \left( {2/\left( {1 + e^{ - 2x} } \right)} \right) - 1. $$
(8)

TANSIG is employed in both the hidden and output layers. They calculate the output based on the net input. The values returned by this activation function range from 1 to + 1.

3.3.2.2 MLP-ANN

The network architecture refers to the structure of connectivity between distinct neurons in ANN. One of the most frequent and useful ANN architectures is the multi-layer perceptron (MLP) network. Each neuron in MLP-ANN is linked to many of its neighbors, with variable weights indicating the relative importance of the individual neuron inputs to the other neurons. MLP is a type of network that belongs to the feed-forward ANN family, and its learning method is backpropagation [50].

In this study, ANN multilayer perceptron is designed based on hidden layer sizes of 10 \(\times \) 10 (ten hidden layers with ten nodes each). The MATLAB 2020b Neural network toolbox which parameter setting similar to the abovementioned ANN was used in this investigation.

3.3.2.3 SVR

SVR is an analytical technique used to explore the connection between one or more predictor variables and a real-valued (continuous) dependent variable [51]. When addressing nonlinear problems, SVR uses a kernel function to transfer the nonlinear regression problem to a higher latitude space, allowing it to determine the best hyperplane to separate the sample points [24],

$$ \begin{aligned} max & \left[ { - \frac{1}{2}\sum\limits_{{i = 1}}^{k} {\sum\limits_{{j = 1}}^{k} {\left( {a_{i} - a_{i}^{*} } \right)\left( {a_{j} - a_{j}^{*} } \right)K\left( {X_{i} ,X_{j} } \right)} } } \right. \\ & \left. {\quad - \sum\limits_{{i = 1}}^{k} {\left( {a_{i} - a_{i}^{*} } \right)\varepsilon + } \sum\limits_{{i = 1}}^{k} {\left( {a_{i} - a_{i}^{*} } \right)Y_{i} } } \right], \\ \end{aligned} $$
(9)

subject to \(\mathop \sum \nolimits_{i = 1}^{k} \left( {a_{i} - a_{i}^{*} } \right) = 0\), \(0 \le \left( {a_{i} - a_{i}^{*} } \right) \le \frac{C}{l}\) and \(i = 1,2, \ldots ,l\),

where \(X_{i}\) is the sample data; \(l\) is the sample size; \(C\) is the penalty coefficient; \(\varepsilon\) surpasses the penalty size of the error sample; \(K \left( {Xi, Xj} \right)\) is the kernel function to the optimal solution of \(a\).

In RStudio SVR setting as library caret for classification and regression training, library e1071 for a regression machine in which epsilon-regression type is applied, the radial basis kernel is used in predicting method, and cost of constraints violation is set as default (= 1). In the case of a probabilistic regression model, the fitted model of the sigma parameter is the scale parameter of the hypothesized (zero-mean) Laplace distribution calculated by maximum likelihood.

3.3.2.4 RFR

The RF technique is a tree-based ensemble approach that was created to overcome the limitations of the classic classification and regression tree (CART) method [52]. RFR is an ensemble learning technique that employs regression algorithms and decision trees [53]. The RF regression technique employs regression trees as base learners. \(N\) bootstrapped sample sets are taken from the source dataset to train the RF [52]. Following the selection of the forest's number of trees (\(C\)), each regression tree is built on a different bootstrap sample. As split candidates, only a limited and fixed number of randomly picked \(K\) predictors are chosen. The procedures are then repeated until \(C\) such trees are formed, and fresh data are anticipated by aggregating the \(C\) trees’ predictions. An RF regression predictor is denoted as [52]:

$$ \hat{f}_{RF}^{C} \left( x \right) = \frac{1}{C}\sum\limits_{i = 1}^{C} {T_{i} \left( x \right),} $$
(10)

where \(x\) is the vectored input variable, \(C\) is the number of trees, and \(T_{i} \left( x \right)\) is a single regression tree created from a subset of input parameters and the bootstrapped samples.

In RStudio RFR, we use the libraries caret for classification and regression training, and randomForest to implement Breiman's random forest method for classification and regression. Number of trees to grow, or ntree = 500, this should not be set too low to guarantee that every input row is forecasted at least a few times [54]. The regression problem's default parameters are mtry or the number of variables randomly selected as candidates at each split, and importance = TRUE.

3.3.2.5 Penalized linear regression

One kind of penalized linear regression is ridge regression. This approach has the potential to reduce the magnitude of the regression coefficients, resulting in improved generalizability for predicting unseen data [53]. The ridge coefficients are calculated using the following equation:

$$ \mathop {{\text{min}}}\limits_{\beta } \sum \left( {y_{i} - \beta_{0} - \beta^{\prime}x_{i} } \right)^{2} + \lambda \sum \left\| {\beta_{j}^{2} } \right\|. $$
(11)

Ridge regression has one apparent drawback: it includes all predictors in the final model. It will reduce all of the coefficients toward zero, but not precisely [55]. The LASSO and E-net regression are newer alternatives to Ridge regression that help to address this limitation. As previously stated, this study used LASSO and E-net regression for feature selection and prediction, as Ridge regression is only employed to carry out the regression technique.

Similar to LASSO and E-net regression, RStudio of the glmnet package is also employed in Ridge regression with centered and scaled datasets pre-processing and tenfold validation are utilized.

3.3.3 Performance evaluation

The mean absolute errors (MAE):

$$ MAE = \frac{1}{N}\sum\limits_{k = 1}^{N} {\left| {y\left( k \right) - \hat{y}\left( k \right)} \right|} , $$
(12)

mean absolute percentage errors (MAPE):

$$ MAPE = \frac{1}{N}\sum\limits_{k = 1}^{N} {\left| {\frac{{y\left( k \right) - \hat{y}\left( k \right)}}{y\left( k \right)}} \right| \times 100\% } , $$
(13)

Root-mean-square error (RMSE):

$$ RMSE = \sqrt {\frac{1}{N}\sum\limits_{k = 1}^{N} {\left( {y\left( k \right) - \hat{y}\left( k \right)} \right)^{2} } } , $$
(14)

Nash − Sutcliffe efficiency coefficient (NSE):

$$ NSE = 1 - {{\sum\limits_{k = 1}^{N} {\left( {y\left( k \right) - \hat{y}\left( k \right)} \right)^{2} } } \mathord{\left/ {\vphantom {{\sum\limits_{k = 1}^{N} {\left( {y\left( k \right) - \hat{y}\left( k \right)} \right)^{2} } } {\sum\limits_{k = 1}^{N} {\left( {y\left( k \right) - \overline{y\left( k \right)} } \right)^{2} } }}} \right. \kern-\nulldelimiterspace} {\sum\limits_{k = 1}^{N} {\left( {y\left( k \right) - \overline{y\left( k \right)} } \right)^{2} } }}, $$
(15)

and determination coefficient (R2):

$$ \begin{aligned} R^{2} = & {{\left[ {\sum\limits_{{k = 1}}^{N} {\left( {y\left( k \right) - \overline{{y\left( k \right)}} } \right)\left( {\hat{y}\left( k \right) - \overline{{\hat{y}(k}} } \right)} } \right]^{2} } \mathord{\left/ {\vphantom {{\left[ {\sum\limits_{{k = 1}}^{N} {\left( {y\left( k \right) - \overline{{y\left( k \right)}} } \right)\left( {\hat{y}\left( k \right) - \overline{{\hat{y}(k}} } \right)} } \right]^{2} } {}}} \right. \kern-\nulldelimiterspace} {}} \\ & \quad \left[ {\sum\limits_{{k = 1}}^{N} {\left( {y\left( k \right) - \overline{{y\left( k \right)}} } \right)^{2} \sum\nolimits_{{k = 1}}^{N} {\left( {\hat{y}\left( k \right) - \overline{{\hat{y}(k}} } \right)^{2} } } } \right] \\ \end{aligned} $$
(16)

will be used to evaluate the performance of the models for the prediction validation method [56,57,58,59], where \(N\) is the amount of validation data, \(y\) is the real output of LPI score which \(\overline{y}\) is its average value and \(\hat{y}\) is the prediction of the output of LPI score which \(\overline{{\hat{y}}}\) is the average of the predicted value.

3.4 Analysis time

Using the big O-notation analysis, we determine the theoretical computational time complexity of ML models. The O-notation is used to present any asymptotic computing characteristics by estimating the worst-case computational time. Table 2 displays the time complexity of the study ML models.

Table 2 Time complexity of models

4 Result and discussion

4.1 Feature selection result

4.1.1 The result of correlation method

Figure 3 depicts the result of the correlation study performed using the Microsoft Excel data analysis tool. When a regression type prediction is used, the input of a correlation model spanning both the dependent and predictor variables is used. We begin by constructing a feature set of predictor factors that have a direct good or outstanding correlation to the dependent variable of LPI (\(r \ge\) 0.5, as shown in the red border in Fig. 3) (namely set A). A set A's predictor variables are X4 (GDP_C; \(r =\) 0.75), X18 (Exp; \(r =\) 0.52), and X19 (Imp; \(r =\) 0.5), for a total of three features. Furthermore, the predictor variables that have a strong or outstanding correlation with a member of set A (\(r \ge\) 0.5, as shown in the yellow border in Fig. 3) are taken into account and subsequently extended to a member of set B. The additional predictor variables of a set A into set B include X2 (N_GDP; \(r =\) 0.83 (with Exp) and \(r =\) 0.93 (with Imp)), X16 (LF; \(r =\) 0.62 (with Exp) and \(r =\) 0.57 (with Imp)), X21 (FER; \(r =\) 0.6 (with Exp) and \(r =\) 0.52 (with Imp)), X22 (Iwd_DI; \(r =\) 0.54 (with Exp) and \(r =\) 0.59 (with Imp)), and X23 (Owd_DI; \(r =\) -0.56 (with Exp) and \(r =\) -0.6 (with Imp)) The total number of features in set B is 3 + 5 = 8. Table 4 shows a summary of the subset of features selected using the correlation approach.

Fig. 3
figure 3

The result of correlation method

4.1.2 The result of PCA method

The PCA result is generated by RStudio in which the dependent variable and predictor variables are used as PCA model input to select the feature for the regression purpose [56]. Figure 4 depicts the proportion of variance of each principal component based on the overall result (only PC1 to PC10 out of a total of 27 PCs). The first and second principal components (i.e., PC1 and PC2) exhibited 34.9 percent variance, whereas PC1 through PC10 may encompass roughly 80 percent of the variation (81.41 percent). Furthermore, when PC1 to PC3 were evaluated, the variation was 46.33 percent, which is more than half of the range of PC1 to PC10. When PC1 to PC5 is considered half of the 10 PCs from PC1 to PC10, the variance is 62.12 percent. To construct a collection of selected features, we examined the attribute that provides a high loading on a factor (equal to or greater than 0.3). The detected attributes in PC1 to PC3 (46.33 percent variance), PC1 to PC5 (62.12 percent variation), and PC1 to PC10 (81.41 percent variation) are allocated to feature sets C, D, and E, respectively. Set C such as X1 (R_GDP_Gr), X2 (R_GDP_Gr), X4 (GDP_C), X5 (Pri_C_Gr), X11 (BB/GDP), X14 (BE/GDP), X15 (GNS_Rt), X18 (Exp), X19 (Imp), X22 (Iwd_DI), and X23 (Owd_DI), 11 features by total. Set D of 16 features is set C plus 5 features which are X8 (CAB), X9 (CP_G), X12 (GDP_D), X21 (FER), and X24 (TB). And set E has a total of 24 from the overall 26 features that exclude X16 (LF) and X17 (CAB/GDP).

Fig. 4
figure 4

The percentage of variation based on principle component

Furthermore, the feature selection while constructing a PCA-biplot is illustrated in Fig. 5, with the selected features represented by blue vectors. The selection is motivated by the interrelationships of each feature to LPI. The direction of the feature vector reflects the positive or negative correlations [65]. When a feature has a comparable direction that is the smallest in the angle of the vector relative to the LPI vector, it indicates the strongest positive correlations, while the opposite direction indicates negative correlations. Vectors close to perpendicular to the LPI vector, on the other hand, are weakly correlated (orange vectors in Fig. 5.) Based on the PCA-biplot, the selected features of set F, i.e., X2 (R_GDP_Gr), X4 (GDP_C), X16 (LF), X18 (Exp), X19 (Imp), X21 (FER), X22 (Iwd_DI), X23 (Owd_DI), and X25 (MMI_Rt), which are 9 features in total. The summary of the subset of features selected using the PCA method is shown in Table 4.

Fig. 5
figure 5

PCA-based biplot

4.1.3 The result of penalized linear regression method

Table 3 displays the results of the LASSO and E-net penalized linear regression methods. Using RStudio, the model for LASSO regression of Eq. (4) has been reducing the predictor parameters from 26 to 10, which offer various interception values and parameter significance.

Table 3 The results of penalized linear regression method

The 9 features selected by LASSO (set G) include X2 (R_GDP_Gr), X3 (Pop_Gr), X4 (GDP_C), X9 (CP_G), X13 (PD/GDP), X18 (Exp), X19 (Imp), X25 (MMI_Rt), and X26 (DC_Gr). For Elastic-net related to Eq. (5), in this study, we vary the α as 0.1, 0.25, 0.5, 0.75, and 0.9. The results of feature selection from RStudio which provides the preferred parameters that the model does not shrink are displayed in Table 3. It was found that when α = 0.9 the set of selected features is similar to the results of LASSO. When α is assigned with the value of 0.25, 0.5, and 0.75, they provide the likely set of 10 selected features (set H). Finally, for α = 0.1, we found that 15 parameters were non-shrink (set I). Set H contains all attributes of set G which X15 (GNS_Rt) is added. And set I comprised all elements of set G with X1 (R_GDP_Gr), X14 (BE/GDP), X17 (CAB/GDP), X20 (NDIF), and X21 (FER) combined. The summary of the subset of features selected using penalized linear regression method is shown in Table 4.

Table 4 Summary of feature selection

4.2 Regression and validation result

According to the subset of selected features (set A to set I), 70% of datasets are trained utilizing identified ML methods such as ANN, MLP-ANN, SVR, RFR, and Ridge. Furthermore, the LASSO and E-net models constantly train their datasets using only the selected feature set that they have been trained on. Furthermore, the entire collection of all features is compared. The test sets are then utilized to validate the model. The validation findings are represented by a performance evaluator or criterion such as MAE, MAPE, RMSE, NSE, and R2.

The summary of performance evaluation findings is given in Table 5 and Fig. 6 for easier comparison. SVR of a feature set I has the greatest MAE performance (*0.1349) (minimum value), while this model with a comparable set has the best MAPE performance as the lowest value (*4.6387). For RMSE, ANN with a comparable feature set I achieves the best performance (*0.1808) (minimum value). NSE values range between \(- \infty\) and 1 to indicate the prediction performance, whereas NSE values close to 1 indicate best prediction performance [57] that the SVR with a comparable feature set I achieves the most excellent performance (*0.8938). Moreover, SVR of a feature set I has the highest R2 performance (*0.8964) (maximum value). And if R2 > 0.8, then there is a strong correlation between actual values and model estimations [16].

Table 5 Summary of ML regression model performance evaluation
Fig. 6
figure 6

Summary of ML regression model performance evaluation (a) MAE (b) MAPE (c) RMSE (d) NSE (e) R2

When ML is examined using the average value based on each model for all feature sets, the best performing model is ANN, with average MAE, MAPE, RMSE, and NSE values of **0.1525, **5.1775, **0.1985, and **0.8691, respectively. When compared to ANN, MLP-ANN, SVR, and RFR show satisfactory performance for all criteria. When compared to each admissible model, the performance of all penalized linear regression methods (Ridge, LASSO, and E-net) is a lesser amount of performance. We determined the average performance of admissible ML models that omit penalized linear regression approaches when we focused on the set of selected features. Set H has the greatest performance for MAE, MAPE, and R2 (***0.1497, ***5.1452, and ***0.8803, respectively), whereas set C has the best performance for RMSE and NSE (***0.1975 and ***0.8728, respectively).

Because of a different perspective or set of criteria produces a different set of optimal results, hence, in this study, we reprocessed using the feature union and intersection operations. To reorganize the acquired feature subsets, a feature union and intersection procedure are presented [66]. Table 6 displays the new feature set based on sets C, H, and I. However, certain feature sets are close enough that we can merge them while reprocessing, such as C ∪ I & C ∪ H ∪ I and C ∩ H & C ∩ H ∩ I. Or like the previous set that we do not reprocess such as H & H ∩ I and I & H ∪ I. Table 7 shows a summary of the ML regression model performance evaluation of the reprocess of a new feature set that has been merged with the parent sets.

Table 6 The feature set based on set C, H, and I union and intersection operation
Table 7 Summary of ML regression model performance evaluation of novel feature set

For MAE, the best performance is obtained by ANN of a feature set C ∪ H (*0.1318), and this model with a comparable set performs best in MAPE (*4.4512), RMSE (*0.1723), NSE (*0.9017), and R2 (*0.9033). When ML is examined using the average value based on each model for all new and parent feature sets, the top-performing model is ANN, with average MAE, MAPE, RMSE, NSE, and R2 values of **0.1412, **4.7946, **0.1874, **0.8834 and **0.887, respectively, while SVR is second. [67] noted that when the connections between parameters become noninvertible (due to a large number of predictor variables), the input and output configurations used in ANN have a major influence on the accuracy. Furthermore, ANN outperforms linear models in terms of accuracy (where the number of important predictor variables is restricted) [68].

Moreover, concentrating on the set of selected features, we determined the average performance of the four suitable ML models described above. The best performance is exhibited in set C ∪ H, which offered the best for all performance metrics, namely MAE, MAPE, RMSE, NSE, and R2 as ***0.1463, ***5.0066, ***0.1908, ***0.8811, and ***0.888, respectively. The members of set C \(\cup \) H which affect the accuracy of LPI when predicting are X1 (R_GDP_Gr), X2 (R_GDP_Gr), X3 (Pop_Gr), X4 (GDP_C), X5 (Pri_C_Gr), X9 (CP_G), X11 (BB/GDP), X13 (PD/GDP), X14 (BE/GDP), X15 (GNS_Rt), X18 (Exp), X19 (Imp), X22 (Iwd_DI), X23 (Owd_DI), X25 (MMI_Rt), and X26 (DC_Gr), total 16 features. As previously stated, the instance must be maximized due to the normally high accuracy even with large datasets, and the number of missing values must be minimized because it can be involved in reducing bias and improving the efficiency of the analysis; furthermore, a limitation on the number of features may support this. Taking into account the parent features of C and H, this limits the number of features to 11 and 10, respectively. Those features may be used as an alternative since they give an adequate performance (closest to the best). The other explanation, set C, is supplied by the PCA method, which is one of the algorithms with higher performance than the other algorithms, resulting in many studies. While the penalized linear of E-net regression provides set H, this regression technique has accurate subset selection but lacks optimum prediction rates.

The four acceptable ML models do not differentiate from each other for the best performance set of C ∪ H and the parent set of C and H based on the errors shown by the boxplots (Fig. 7), and the RFR is circled as the biggest error values. However, it is discovered that the extreme error levels of all models are nearly the same. Furthermore, Taylor diagrams were created for the evaluation of the acquired results, and they allow for the determination of the correctness of the developed models in many areas [57, 58]. Figure 8 clearly shows that the prediction results of the set of C ∪ H and the parent set of C and H based on the four acceptable ML models are close to the observations. What is interesting about the findings shown in Fig. 8c is that the ANN model outperformed the other models for the set of C ∪ H that gives the shortest distance to the observation. The statistical significance of the acquired data was examined using the Kruskal–Wallis test in this study, as well as an analysis of whether the predicted and observed or logistics performance index distributions, were consistent [57, 69]. H0 denotes a hypothesis based on the statistically significant difference between mean predicted and observed LPI values. Table 8 reveals that the H0 hypothesis was rejected (P value \(\ge \) 0.05) in all C, H, and C ∪ H set predictions; in other words, there is no significant difference between predicted and observed averages. H0 hypotheses were rejected, similarly, this indicates all of the ANN, MLP-ANN, SVR, and RFR models produce more accurate results. This suggests that the pre-processing of data preparation and feature selection had a statistically significant beneficial influence on ML predictions.

Fig. 7
figure 7

Boxplot diagrams of the best performance set of C \(\cup \) H and the parent set of C and H (a) set C (b) set H (c) set C \(\cup \) H

Fig. 8
figure 8

Taylor diagrams of the best performance set of C \(\cup \) H and the parent set of C and H (a) set C (b) set H (c) set C \(\cup \) H

Table 8 Result of Kruskal–Wallis test (95% significance level)

Table 9 shows the analysis time of ML algorithms provided by analysis tools (MATLAB shows the values in second and RStudio shows the values in millisecond). The average analysis time in the MLP-ANN training procedure was longer (a constant of eight seconds approximately for all sets), otherwise less than a second.

Table 9 Analysis time of ML algorithms (second)

4.3 Discussion

Finally, as shown in Table 10, we discussed the finding outcomes based on both feature selection techniques of filter and embedding method which is focused on the suggested statistical property and ML algorithm. The discussion describes the advantages and disadvantages of models that influence the findings of this study.

Table 10 Finding discussion based on the study model

To get good results, effective wrapper strategies, such as sequential search, or evolutionary algorithms, such as Particle Swarm Optimization (PSO) or Genetic Algorithm (GA), provide local optimum solutions and are computationally viable, are utilized. Because of the potential of overfitting and computationally costly [72], wrappers have a significant disadvantage, particularly in terms of computational inefficiency, which becomes more obvious as the feature space develops. The wrapper technique is thus eliminated from this analysis, although it will be significant in future studies.

5 Concluding remark and future work

In conclusion, the current study illustrates an application of machine learning regression to feature selection. In this study, we looked at the impact of logistics performance utilizing the World Bank's LPI and the economic attributes of S&P Global Market Intelligence's macroeconomic data source. The 500 case samples ranged from 2009 to 2018, with an initial set of 26 economic features accessible. Furthermore, the number of instances (maximize) and the missing value of nation economic data have been traded-off in the first feature selection (minimize). The filter methods of correlation and PCA are employed in the suggested feature selection procedure. The ML regression algorithms ANN, MLP-ANN, SVR, RFR, and Ridge are then utilized to train and verify the data set depending on the selected feature. To select the feature, the embedded technique of penalized linear regression of LASSO and E-net is also used, followed by continuous training and validation of the dataset. In feature selection, the proposed ML regression uses a subset of penalized linear regression features to train and validate the dataset. According to the results of the model's performance based on the MAE, MAPE, RMSE, NSE, and R2 criteria, the feature set of PCA (set C), and E-net (set H and I) offer the most closely acceptable performance.

Then, using parent sets (C, H, and I), a feature union and intersection operation are performed. Finally, the set of C ∪ H (a total of 16 features) performs the best across all criteria. Furthermore, the findings may address the study issue that ML algorithms can select the appropriate set of economic features that reflect the country’s logistics performance. In response to the question: what is the best ML regression technique for predicting logistics performance based on selected economic attributes? The findings indicate that ANN is the most effective model for prediction in this study. Furthermore, we note that features C and H limit the number of features to 11 and 10, respectively. Those features may be used as an alternative since they give an adequate performance (near to the best) when it is necessary to maximize the instance and reduce the missing data of the dataset.

Furthermore, in a future study, the focus may be on utilizing more diverse feature dimensions integrated with economic attributes. The unique elements connected to the megatrend, such as the carbon emissions rate, the cost and consumption rate of fuel and renewable energy, the e-commerce market size, and growth, may reflect on logistics performance in the new era of a global supply chain. The enrichment work also extends to the wrapper technique.