Introduction

In machine learning, a regression problem refers to estimating a real-valued continuous response (output) based on the values of one or more input variables. By determining the relationships between output and input variables, a regression method numerically predicts a target value. In the literature, various regression techniques have been introduced for a wide range of machine learning problems. Among them, k-nearest neighbor regression (KNNreg) (Benedetti 1977; Stone 1977; Turner 1977) has become one of the most widely used regression techniques due to its simplicity and robustness (Buza et al. 2015). This method is an adapted version of the k-nearest neighbor (KNN) model that was initially introduced by Cover and Hart (1967) for applying classification problems. The main idea of the KNNreg is to predict the output value for a given test sample by averaging the output values of the nearest neighbor samples (Hu et al. 2014).

Though the KNN method has many significant advantages, it intuitively suffers from some weaknesses, for example, giving equal importance to all nearest neighbors (even if some of them are quite far from the test sample) in the classification process. To improve model and alleviate such issues, Keller et al. (1985) introduced the idea of using the degree of membership in the KNN method to propose its fuzzy version, called the fuzzy k-nearest neighbor (FKNN) classifier. Thanks to its capability of tackling uncertainty issues in the data, the FKNN model has proven promising for classification problems (Chen et al. 2013; Yu et al. 2002) compared to the classical KNN method. Although the FKNN classifier has received much attention in terms of classification, it has received less attention in the context of regression. This motivated us to establish the fuzzy k-nearest neighbor regression (FKNNreg) model in this research by modifying the original FKNN rule.

Typically, the distance metric is one of the main components of distance-based classifiers such as the KNN and FKNN methods (Rastin et al. 2021). Even though the Euclidean distance is the most common distance metric used in such methods to measure the similarity between two data samples, it is often not optimal for every problem domain (Cai et al. 2020; Nguyen et al. 2016). Several research papers have reported better results with a more general choice of distance metric (Chang et al. 2006; Dettmann et al. 2011; Jenicka and Suruliandi 2011; Kaski et al. 2001; Koloseni et al. 2012, 2013). Besides, the Euclidean distance has several drawbacks. For example, if two data samples have no feature values in common, they might have a shorter distance than the other sample pairs, including the same feature values (Shirkhorshidi et al. 2015). These facts encouraged us to examine the effectiveness of the Minkowski distance in the FKNN rule in the regression setting for low- and high-dimensional datasets.

The main goal of this study is to introduce the FKNNreg using the Minkowski distance metric and to examine its efficiency. The combination of the Minkowski distance metric and the FKNNreg has not been studied in the literature before. This led us to create the Minkowski distance-based fuzzy k-nearest neighbor regression (Md-FKNNreg) algorithm. A key advantage of this method is that the nearest neighbors are weighted by fuzzy weights considering their similarity to the test sample, leading to the most accurate prediction through a weighted average. Also, utilization of the Minkowski distance allows greater flexibility for obtaining more relevant neighboring samples close to the target sample.

Intuitively, most available regression models (e.g., multiple linear regression [MLR], least absolute shrinkage and selection operator [LASSO] regression) are based on assumptions regarding the distribution of the data. However, it is rarely confirmed that these assumptions apply to real-world problems. That is being said, an interesting fact about the KNNreg methods is that they do not explicitly make any assumptions about the underlying data (Yao and Ruzzo 2006) or model’s components and simply use training data to make predictions. Another advantage is that they are, in general, relatively easy to implement and interpret and can potentially be applied even for non-linear problems (Hu et al. 2014). Moreover, support vector regression (SVR) is recognized as one of the well-known methods applied for non-linear regression problems. However, its utilization is restricted in various disciplines due to the difficulty of selecting suitable parameters for the model (Liu et al. 2013). In this regard, FKNNreg methods could be better alternatives in the regression context, and the proposed new KNNreg method is found to be significant for non-linear regression problems.

To study the performance of the proposed Md-FKNNreg model, we conducted an experiment using real-world data from various applications. We compared the regression performance of the proposed variant with the KNNreg, Lasso, SVR, and multiple linear regression models. In addition, the Manhattan distance-based fuzzy k-nearest neighbor regression (Man-FKNNreg) and Euclidean distance-based fuzzy k-nearest neighbor regression (Euc-FKNNreg) methods were also implemented, and the results were compared. To evaluate the regression performance, we used root mean square error (RMSE) and the coefficient of determination (\(R^2\)) values as the evaluation metrics. We also tested whether there was a statistically significant difference between the regression results for the Md-FKNNreg and baseline methods.

The main contributions of this paper can be summarized as follows:

  1. (1)

    We propose a new regression approach based on the FKNN algorithm.

  2. (2)

    We introduce the Minkowski distance into the nearest neighbors search in the proposed algorithm and investigate its efficiency and robustness.

  3. (3)

    We demonstrate the performance of the proposed regression model on low- and high-dimensional real-world data coming from different domains.

  4. (4)

    We analyze, compare, and benchmark the regression results of the proposed method with select well-known state-of-the-art regression methods.

The remainder of this paper is organized as follows. Section 2 discusses the background information related to the present study. Section 3 briefly provides the theoretical underpinning of the KNNreg and FKNNreg models and the Minkowski distance measure. Section 4 proposes the Md-FKNNreg method. Section 5 introduces the data used and the experiment setting for the proposed method and presents and discusses the empirical results obtained with the proposed method and benchmarks. Section 6 summarizes the main findings and provides concluding remarks.

Related work

The KNNreg model has the potential to tackle linear and non-linear problems in an effective way (Cai et al. 2020) and performs especially well in a high-dimensional space. Accordingly, the growing popularity of the KNNreg method can be seen in various fields, including renewable energy (Hu et al. 2014; Huang and Perry 2016; Zhou et al. 2020), physics research (Durbin et al. 2021), biological studies (Yao and Ruzzo 2006), transportation (Cai et al. 2020; Dell’Acqua et al. 2015), robotics (Chen and Lau 2016), and telecommunication (Adege et al. 2018). In addition, some studies have also employed the KNNreg model with other approaches to develop effective hybrid models for specific applications. For example, Chen and Hao (2017) proposed an integrated framework by employing support vector machine (SVM) and KNNreg for stock market prediction. Salari et al. (2015) also presented a novel hybrid approach with a combination of a genetic algorithm (GA), the KNNreg method, and artificial neural network (ANN) for classification problems. Cheng et al. (2019) utilized the same idea as the KNNreg to introduce a novel approach for missing value imputations. Furthermore, the simplicity and strength of the KNNreg algorithm have encouraged researchers to develop different enhanced variants (for examples, see Buza et al. 2015; Guillen et al. 2010; Nguyen et al. 2016; Song et al. 2017) and to construct mathematical estimations (Biau et al. 2012).

An ideal distance measure must have the ability to precisely detect the similarity between two samples while allowing the researchers to understand how to compare, classify, or cluster those samples. Therefore, such metrics have great potential to influence the outcomes of the models used (Bergamasco and Nunes 2019). Accordingly, some previous studies focused only on which similarity measure best fit the particular situation (for examples, see Rodrigues 2018; Moghtadaiee and Dempster 2015; Huo et al. 2021). The Minkowski distance is the most investigated measure among the frequently applied techniques for measuring the similarity between instances in machine learning-based applications (Bergamasco and Nunes 2019; Cordeiro and Makarenkov 2016; Gueorguieva et al. 2017). The Minkowski distance is the main focus of this research because it offers the opportunity to compute the distance between two instances in several different ways and holds several well-known distances as special cases, e.g., the Manhattan and Euclidean distances.

The concept of the fuzzy theory, originally introduced by Zadeh (1965), can operate under uncertainty and has advanced in many different ways in various applications (for examples, see Chen et al. 1990; Chen and Hsiao 2005; Chen and Chen 2007; Chen and Chang 2010; Chen et al. 2009; Horng et al. 2005; Zeng et al. 2019). The FKNN classifier (Keller et al. 1985) was derived from fuzzy theory and has been one of the most effective techniques in supervised machine learning tasks. Nikoo et al. (2018) applied the FKNN classifier to a regression application without modifying its original algorithm explicitly (i.e., it was operated as a classification task). However, to the best of our knowledge, no one has attempted to utilize the FKNN model in the regression setting. Thus, the effectiveness of FKNNreg for machine learning applications requires further investigation.

Preliminaries

This section briefly discusses the KNNreg method, the FKNN method, and the Minkowski distance measure.

K-nearest neighbor regression

KNNreg (Benedetti 1977; Stone 1977; Turner 1977) is a simple, effective, and robust nonlinear regression method. The basic idea of KNNreg is to predict an output value to a given input sample based on a fixed number (k) of its nearest neighbors found from the input-output training samples. The k is a smoothing parameter, and its value controls the adaptability of the KNNreg method (Hu et al. 2014). KNNreg does not require an explicit training step besides the initial dataset’s inputs and outputs, which represent a unique property. The notion of the KNNreg model can be formally defined as follows.

Let \(T=\{(X_i, y_i)\}_{i=1}^N\) be a training dataset with N samples, where \(X_i=\{x_1^i, x_2^i, \ldots , x_m^i\}\in \mathbb {R}^m\) is an input sample i from m-dimensional feature space, and its output value (response variable) is \(y_i \in Y\), where \(Y=\{y_1, y_2, \ldots , y_N \}\) denotes the set of output values. For a given new data sample X, the goal is to learn the predictor function h(X) from the training dataset such that \(\hat{y}\approx h(X)\), where \(\hat{y}\) is the estimated value for the output y of X. The KNNreg starts with measuring the distance (d) between the test sample X and each sample \(X_i\) in T. In this case, the Euclidean distance is the most commonly adopted distance metric, and its formulation for the distance between \(X=\{x_1, x_2, \ldots , x_m\}\) and \(X_i\) is presented by Eq. (1).

$$\begin{aligned} d(X, X_i) = \sqrt{\sum _{j=1}^{m}(x_{j} - x_{j}^i)^2}. \end{aligned}$$
(1)

Next, the set of k nearest neighbors, \(N_X^{k}=\{(X_i, y_i)\}_{i=1}^k\) for X, is found from the reordered training samples in T according to the increasing Euclidean distances. Finally, the output value y for X is estimated by taking the arithmetic mean of the output values (\(y_1, y_2, \ldots , y_k\)) of the nearest neighbors (Song et al. 2017; Biau et al. 2012; Györfi et al. 2002) as follows:

$$\begin{aligned} \hat{y} = \frac{\sum _{j=1}^{k}y_j}{k}. \end{aligned}$$
(2)

This is based on the assumptions that training samples in the \(N_X^{k}\) have similar output values to h(X) (Kramer 2011) and also that all nearest neighbors in the \(N_X^{k}\) have equal importance in the prediction (Cover and Hart 1967).

Fuzzy k-nearest neighbor classification method

Unlike the KNN algorithm, the FKNN method uses the unbiased weighing scheme in the decision rule using the distances between the test sample and the nearest neighbor samples. Put it differently, the FKNN model computes a membership to the test sample for each class and makes the class decision according to the highest membership degree (Keller et al. 1985). These fuzzy memberships have excellent potential for accurate predictions (Kumbure et al. 2020). The membership degree of a given new sample X in a class i that is represented by the k nearest neighborsFootnote 1 is measured as follows:

$$\begin{aligned} u_i(y)=\frac{\sum _{j=1}^k u_{ij}(1/\left\| X-X_j\right\| ^{2/(q-1)})}{\sum _{j=1}^{k}(1/\left\| X-X_j\right\| ^{2/(q-1)})}, \end{aligned}$$
(3)

where \(q\in (1, +\infty )\) is the fuzzy strength parameter that controls the Euclidean distance \(\Vert X-X_j\Vert ^{2}\) between X and \(X_j\) to weigh the contribution of each nearest neighbor to the membership value. Also, \(u_{ij}\) is the membership of the sample \(X_j\) from the training data to the class i among the k nearest neighbors. Two methods are used to measure the \(u_{ij}\): crisp memberships and fuzzy memberships. More details about these methods can be found in the work by Chen et al. (2011).

Minkowski distance

The Minkowski distance measure (also called \(L_p\) norm space) is a class of various distance functions that are formed by the parameter p. For two given samples \(X_i\) and \(X_j\) where \(X_i=\{x_1^i, x_2^i, \ldots , x_m^i\}\in \mathbb {R}^m\) and \(X_j=\{x_1^j, x_2^j, \ldots , x_m^j\}\in \mathbb {R}^m\), the Minkowski distance metric is defined as follows:

$$\begin{aligned} d_{Md}(X_i, X_j) = \left (\sum _{t=1}^{m}\vert x_t^i- x_{t}^j\vert ^p\right )^{1/p} \text { for}\quad p\ge 1. \end{aligned}$$
(4)

From this metric, we can specify different distance functions by changing the value of p. For example, we can obtain the Manhattan distance (also known as the city block distance or \(L_1\) norm) by setting \(p=1\) and the Euclidean distance, also referred to as \(L_2\) norm (see also Eq. (1)) by setting \(p=2\).

Proposed fuzzy k-nearest neighbor regression model using Minkowski distance

In this research, we focus on the fuzzy k-nearest neighbor regression. Given this, we define the FKNN method for regression together with the Minkowski distance. In this way, the novel regression method, Md-FKNNreg, is introduced. This method aims to achieve a reliable prediction for the predictor function by allowing the Minkowski distance to be adapted to the particular context with the optimal conditions. The procedure of the Md-FKNNreg method mainly includes four steps: measuring the distances, recognizing the nearest neighbors, computing the fuzzy weights, and making the prediction. The detailed process of this method is presented using the same notations in Sect. 3.1 as follows.

Step 1: Determine the Minkowski distance \(d_{Md}(X, X_j)\) between X and \(X_i\) in T according to: \(d_{Md}(X, X_j) = \big (\sum _{t=1}^{m}\vert x_{t}- x_{t}^j\vert ^p\big )^{1/p}\).

Step 2: Find the set of k nearest neighbors \(N_X^{k}\) from the ranked training data samples according to increased Minkowski distances. Here, we used a grid-based search to find the optimal parameter p for the Minkowski distance and k that best fit a particular dataset.

Step 3: Calculate the fuzzy weight (\(w_i\)) for each nearest neighbor j using \(d_{Md}(X, X_j)\) as follows:

$$\begin{aligned} w_j =\frac{1}{\big (1/d_{Md}(X, \,X_j)\big )^{\frac{2}{q-1}}} \text {, for } j=1, 2, \ldots , k, \end{aligned}$$
(5)

where q is a fuzzy strength parameter, and \((\frac{2}{q-1})\) indicates the fuzziness exponent. The closer q is to 1, the larger the weights are. For distances over 1 unit, the larger q is, the smaller the weights are.

The purpose of these weights is to define a comprehensive linear predictor for the output value y such that \(h(X)=W^TY\), where \(W=\{w_1, w_2, \ldots , w_k\} \in \mathbb {R}^k\). The weighted value \(w_j\) (\(0\le w_j \le 1\)) of the nearest neighbor \(X_j\) reflects its relative importance to Y.

Step 4: Predict the output value \(\hat{y}\) for X by taking the weighted average (with the fuzzy weights) of the outputs \(y_{j}\) for \(j=1, 2, \ldots , k\) in the \(N_X^k\) according to the following equation:

$$\begin{aligned} \hat{y} = \frac{\sum _{j=1}^k w_j y_{j} }{\sum _{j=1}^{k}w_j}. \end{aligned}$$
(6)

It is clear that in this method, the Minkowski distance is used not only to find the nearest neighbors but also to measure the weights. Accordingly, the Minkowski distance plays a critical role in the proposed framework. Besides, the Md-FKNNreg method is intuitively adaptive to the number of nearest neighbors k and the p of the distance function to vary with iterations throughout its search in a particular situation. This characteristic allows the method to expand the search area to a broader domain. The steps of the Md-FKNNreg method discussed above are summarized in Algorithm 1 by introducing a pseudo-code to it. In addition, the pseudo-code for the grid search method used is presented in Algorithm 2.

figure a
figure b

In the KNNreg method, the prediction of the output for a new sample is made through a uniform weighting scheme (Cheng 1984). This means it makes the prediction by taking the simple average of the outputs of the nearest neighbor samples and does not consider the distances between the new sample and its k nearest neighbors (i.e., all nearest neighbors have equal influence across the prediction) (Kramer 2011). In contrast, the FKNNreg uses an inverse weighting scheme that assigns higher weights to the closer training samples, allowing them more influence over the prediction. Moreover, the fuzzy strength parameter q controls how heavily the distance is weighted when determining the contribution of each nearest neighbor to the target sample (Keller et al. 1985). For example, when \(q=2\), the contributions of the nearest neighbors are weighted by the reciprocal of their distances to the target sample. Regarding the distance, the adopted distance metric plays a crucial role in achieving the best possible nearest neighbors and weighting them. Accordingly, the Minkowski distance is utilized in the proposed Md-FKNN method since it offers a more generalized natureFootnote 2 than the Euclidean distance and Manhattan distance. It has shown superior performance with supervised and unsupervised machine learning models compared to other distance measures (for examples, see Aggarwal et al. 2001; Ranmya and Sasikala 2019). Considering the above facts, overall, the proposed Md-FKNNreg is expected to produce significantly better results than the KNNreg and FKNNreg methods.

Experiment and results

This section presents the descriptions of the data sets used and the empirical procedures of the experiments conducted to investigate the regression performance of the proposed Md-FKNNreg model.

Data description

For our experiment, we selected eight real-world datasets that are freely available at the UCI Machine Learning repository (Dheeru and Taniskidou 2017) and at the Knowledge Extraction based on Evolutionary Learning (KEEL) repository (Alcala-Fdez et al. 2011). As summarized in Table 1, each of these datasets holds different characteristics in terms of the number of instances and features. Also, the related area of each of the datasets is provided in the “Domain” in Table 1.

Table 1 Summary of the datasets used in the experiment

Experimental setting

In each collected dataset, the data samples were divided into \(40\%\) for training, \(40\%\) for validation, and \(20\%\) for testing based on the works of Kumbure et al. (2019, 2020). Before data splitting, all the datasets were normalized into the unit interval of [0, 1] to avoid data differences between small and large ranges. For cross-validation, we adopted the holdout method (Arlot and Celisse 2010), in which the training and validation datasets were randomly sampled 30 times, and mean performance measures were computed from the results. The examination of the proposed method using the data can be categorized into two phases, training & validation and testing. In the training and validation step, we trained the model by optimizing the parameter values for p in the Minkowski distance and the number of k nearest neighbors. Here, we used mean regression error to determine the optimal parameter values. To find the best possible values for the parameters, we deployed a grid search technique during the training and validation. Then, we evaluated the performance of the model with optimal parameters in the testing phase. The steps of this process are summarized by the flowchart in Fig. 1.

Fig. 1
figure 1

A workflow of the development and evaluation of Md-FKNNreg model

The proposed Md-FKNNreg, Man-FKNNreg and Euc-FKNNreg models were implemented using MATLAB 2019b software. The KNNreg was implemented from scratch. The SVR, LASSO, and MLR models were developed using the Statistics and Machine Learning Toolbox in MATLAB. The computer used was an Intel\(^{\circledR }\) CoreTM i5 1.8GHz, 16GB RAM with Microsoft Windows 10 operating system.

Baseline models

We compared the performance of the developed Md-FKNNreg method with the classical KNNreg, Man-FKNNreg, and Euc-FKNNreg methods. In addition, we also selected three more commonly used regression techniques, namely SVR (Drucker et al. 1997), LASSO regression (Tibshirani 1996; Wang et al. 2018), and MLR (Montgomery et al. 2012). The basic concepts of these methods are briefly presented.

SVR, a variant of SVM (Cortes and Vapnik 1995), is a non-linear kernel-based regression approach that performs the regression by constructing a hyperplane in a high-dimensional space. For a given test sample X, SVR develops a predictor function: \(h(X)=\sum _{i=1}^{N}(\alpha _i - \alpha _i^*) K(X, X_i)+b\) by mapping training samples onto the high-dimensional features space. Here, \(\alpha _i\) and \(\alpha _i^*\) are non-zero Lagrange multipliers, b is a bias constant, and K is the kernel function that represents the inner product of X and train sample \(X_i\).

LASSO is a regularization-basedFootnote 3 linear regression model. The model is selected to minimize the objective function: \(\sum _{i=1}^{n}(w_0+\sum _{j=1}^{m}w_jx_{j}^i-y_i)^2+\lambda \sum _{j=1}^{m}w_j\), where \(\lambda\) is the regularization parameter that is used to control the empirical error. As the \(\lambda\) value increases, LASSO changes more coefficients to zero (Wang et al. 2018).

MLR (also referred to as ordinary least squares regression) is one of the oldest and most frequently employed techniques for analyzing the relationship between the response variable and multiple input variables. The general form of the MLR model can be given by \(h(X_i) = w_0 + \sum _{j=1}^{m}w_jx_j^i + \epsilon\), where \(h(X_i)\) is the predictor function for the sample \(X_i=x_1^i, x_2^i,\ldots , x_m^i\), \(w_0\) is the constant, \(w_j\) is the coefficient for the variable j, and \(\epsilon\) is the error term (\(\sim N(0, \sigma ^2)\)) of the model.

Parameter settings

The detailed parameter settings for the proposed method and benchmarks are presented in this sub-section. In the Md-FKNNreg algorithm, the value for p of the Minkowski distance was selected from \(\{1, 1.5, \ldots , 5\}\). The number of nearest neighbors k was chosen from the range \(\{1, 2, \ldots , 25\}\) for all nearest neighbor regression approaches. The value of the fuzzy strength parameter m was kept constant at \(m=1.5\) according to Arif et al. (2010) for both the Md-FKNNreg and FKNNreg models.

The kernel function is the most critical ingredient in the SVR model (Ali and Smith-Miles 2006) because it helps the model achieve robust mapping from training samples to the prediction. Accordingly, we tested the performance of the SVR model using three different kernels: linear, Radial Basis Function (RBF), and polynomial based on Ali and Smith-Miles (2006). For the Lasso model, the regularization parameter \(\lambda\) was tuned from the range \(\{0.001, 0.01, \ldots , 100\}\) by following the experiments of Saccoccio et al. (2014). Here, we attempted to create a balance for the \(\lambda\) values due to the fact that low values of \(\lambda\) prefer predictor functions that achieve a small training error while larger values tend to obtain simple prediction functions (Wang et al. 2018). With the multiple regression model, we tested four different model types from the toolbox: linear, interactionFootnote 4, purequadraticFootnote 5, and quadraticFootnote 6. Using these different types of models, we were able to generate more sophisticated MLR versions for competition in the comparison even though high-order terms of the features were not deployed with the nearest neighbor methods. The rest of the parameter values in the SVR, LASSO, and MLR models were set to the default values according to the toolbox specifications.

Evaluation metrics

To evaluate the prediction performance of the proposed regression method and benchmarks, we adopted two frequently applied measures: RMSE and \(R^2\). RMSE computes the square root of the average differences between the model’s predictions and the true values. \(R^2\) is the proportion of the variation in the response variable, which is “explained” by the regression model compared to the mean (Kurz-Kim and Loretan 2014). It is a statistical measure that implies how closely the data points in the response variable fit to the values of the regression model. In general, higher \(R^2\) values and smaller RMSE values reflect better performance in the regression model (Pham 2019). The formulas used for both evaluation methods are defined as follows:

$$\begin{aligned}&RMSE = \sqrt{\frac{1}{n}\sum _{i=1}^{n}(\hat{Y_i}-Y_i)^2} \end{aligned}$$
(7)
$$\begin{aligned}&R^2 = \left( 1- \frac{\sum _{i=1}^{n}(Y_i-\hat{Y_i})^2}{\sum _{i=1}^{n}(Y_i-\bar{Y})^2}\right) \times 100\% \end{aligned}$$
(8)

where n is the number of samples in the test data, \(\hat{Y_i}\) indicates the predicted value, \(Y_i\) indicates the true value of the ith test sample, and \(\bar{Y}\) is the average of the true values. As shown in Eq. (8), the percentage values of \(R^2\) are considered.

When it is necessary to apply several models to a particular problem and pick the best one, the usual method is to use several evaluation metrics to measure the models’ performance and select the best model with the highest performance. However, when trying to prove that one model outperforms another for a particular problem, we must use a statistical test of significance and validate the claim of improved performance (Borovicka et al. 2012).

Following Chen et al. (2011), we adopted the paired t test, one of the most commonly used statistical tests in machine learning. This analysis tested the null hypothesis that there is no significant difference between two mean RMSE rates at the 0.05 significant level. Here we considered the error samples from the holdout method (size of \(30\times 1\)) for each regression model when the optimal parameter values were employed. The standard deviations were also computed.

Results and discussion

In this subsection, we first present the results of the proposed method compared with the baseline methods from the training and validation step. After that, optimal parameter values observed for each model are discussed. Then, the performances of the fitted models for the training and validation data are evaluated with the test datasets.

Results with the validation data samples

Table 2 summarizes the results of all the methods for each of the datasets from the training and validation step. In the table, we report the minimum RMSE and standard deviation (STD) for the RMSE as the performance measures. Notice that these mean RMSEs and standard deviations are the result of the holdout method with 30 repetitions.

Table 2 The results obtained for all methods in the training & validation step

From Table 2, it is apparent that the proposed Md-FKNNreg method outperformed all benchmarks for six datasets and had the second-best performance for the rest of the datasets. Also, the standard deviations of the proposed method were reasonable for all cases. In particular, the Md-FKNNreg method achieved significantly improved performance compared with the Euc-FKNNreg method even though the results of the two methods were comparable in some cases, for example, in the cases of Servo and Stock. Additionally, the Md-FKNNreg and Man-FKNNreg models produced the same results over six datasets. Moreover, the KNNreg and SVR models achieved the lowest mean RMSE results in the cases of Servo and Laser, respectively. Finally, neither the LASSO and nor MLR models offered comparative results for the datasets compared to the other models.

Figure 2 illustrates the \(R^2\)values of each for the optimized regression models from the training and validation step for each dataset. Notice that the \(R^2\) values revealed by the bar-heights in the graphs refer to the maximum of mean \(R^2\) values obtained from the holdout cross-validation. The bar graphs in the figure are also displayed so that the blue bar represents the highest \(R^2\) while the other bars indicate the rest of the values. From Fig. 2, one can observe a similar indications about the regression performance of the proposed Md-FKNNreg method and benchmarks as from Table 2.

Fig. 2
figure 2

Observed \(R^2\) values of each model for each dataset

Evaluation of the optimal parameter values

During the training and validation, we evaluated the regression results by tuning the parameters of the Md-FKNNreg and benchmark models. Figure 3 displays the impacts of different combinations of the parameters k and p in the Md-FKNNreg method on the RMSE (and \(R^2\)) with the Stock dataset. Figure 3 demonstrates how these parameters maintain the good performance of the Md-FKNNreg method for the Stock data set.

Fig. 3
figure 3

The averages of the RMSE and \(R^2\) values for the Md-FKNNreg method for different parameter combinations (p, k) in the training and validation step with the Stock dataset

Corresponding to the results in Table 2, the optimal parameter values observed for the proposed Md-FKNNreg and benchmark models are presented in Table 3. The table shows that \(p=1\) (i.e., the Manhattan distance) obtained the best performance with the Md-FKNNreg method for the majority of the datasets. We also applied the Manhattan distance-based FKNNreg and received the same results as for the Md-FKNNreg method with \(p=1\). This finding is consistent with the implications of Aggarwal et al. (2001), who showed that the Manhattan distance (\(L_1\) norm) is the best option for high-dimensional applications. Additionally, it seems that relatively low k values (varying from 1 to 13) are better suited to the original KNNreg method, while high k values (ranging from 2 to 25) work better for its fuzzy versions. Moreover, the RBF kernel appears to be the most suitable for the SVR model because it achieved high performance for all datasets except for Baseball and Servo data. Regarding the LASSO model having the most instances of the least-squares estimations, we believe that this is because of the lowest \(\lambda =0.001\) shows the optimum. The optimal type of MLR model varied depending on the particular datasets. By taking the parameters and other results together, it is evident that the fuzzy variants have more potential for linear and non-linear problems. For instance, the case with the Baseball data could be considered a linear problem since both SVR and MLR hold linear parameters, but high performance was achieved with the Md-FKNNreg, Man-FKNNreg, and Euc-FKNNreg methods.

Table 3 Optimal parameter values of each model for each dataset

Results with the test data samples

This subsection presents the regression results of the Md-FKNNreg and baseline models with the testing data samples that were initially split from the original datasets. In this testing phase, we used the optimized parameter values and training data samples stored from the cross-validation step to evaluate the models’ performances with the previously unseen test data samples.

Table 4 summarizes the results with the mean RMSEs and the standard deviations (STD) of the proposed Md-FKNNreg and benchmarks models for the selected datasets. In addition, the average computational time (Com. time, in seconds) of each method in the testing phase is also reported.

Table 4 The results obtained for all methods in the testing step

The results of Table 4 show that the Md-FKNNreg method outperformed all benchmarks for all datasets, verifying the effectiveness of the proposed Md-FKNNreg method for regression problems. Compared with the Man-FKNNreg results, the Md-FKNNreg model achieved somewhat higher accuracy for the Airfoil and Qsar Fish datasets. Additionally, the Md-FKNNreg showed significantly better performance than the Euc-FKNNreg method across almost all datasets. This reveals that introducing the Minkowski distance in the learning part can result in finding more reasonable nearest neighbors, leading to better performance compared to the Manhattan and Euclidean distances. Considering the KNNreg and SVR models, even though these achieved the lowest errors in some cases during the training and validation, the Md-FKNNreg method outperformed them on all testing cases. This proves that the proposed Md-FKNNreg model has more power to overcome over-fitting issues than both KNNreg and SVR models. Moreover, the lowest regression performance for the testing data, similar to the validation data, occurred for both the LASSO and MLR models.

Based on the testing times, it is clear that the computational complexity of the FKNNreg methods was relatively high compared with the other methods used. This might be because of the inclusion of the weight generation process based on the inverse of the distances and the fuzzy strength parameter. Additionally, the proposed Md-FKNNreg method required more time (in seconds) than the Euc-FKNNreg and Man-FKNNreg methods to deliver the results since it includes an additional computation with the parameter p.

To validate the test results statistically, a paired t test was applied, and the observed results for the P values and test statistics are presented in Table 5. The t test results demonstrate whether there is a statistically significant difference between mean RMSEs of the compared regression methods. From the evidence in the table, it is apparent that the Md-FKNNreg method yielded statistically significantly better performance than the benchmarks in terms of the lowest error. In particular, we observed that the proposed method could not produce statistically significant results for either the Servo dataset (compared with the Euc-FKNNreg and KNNreg methods) or the Laser dataset (compared with the Euc-FKNNreg and SVR methods). It should be noted here that we did not find a statistically significant difference between Md-FKNNreg and Man-FKNNreg results for any dataset. In addition to finding no difference between the test results of Md-FKNNreg and Man-FKNNreg for six datasets, the t test produced no evidence of a significant difference between these methods for the Airfoil and Qsar Fish cases.

Table 5 Paired t test results on the performance of the Md-FKNNreg method vs. five benchmarks for the test datasets

Conclusions

This paper proposed a new generalized regression model based on the FKNN rule and investigated its effectiveness on different regression problems from various domains. The Minkowski distance metric was introduced into the nearest neighbor search in the proposed algorithm to examine how it improves accuracy. Accordingly, the proposed method was named the Md-FKNNreg model. The effectiveness of the Md-FKNNreg method was evaluated in comparative experiments with the standard nearest neighbor and three well-known regression methods, namely SVR, LASSO, and MLR. In addition, Man-FKNNreg and Euc-FKNNreg methods were implemented, and their results were compared. For the experiments, we used eight real-world datasets that are freely available from machine learning repositories. The regression performance of each model for those datasets were discussed in terms of the RMSE, \(R^2\) and standard deviation. The results showed that the Md-FKNNreg method outperformed the benchmarks and is a suitable choice for regression problems. In our experiments, Md-FKNNreg gave the lowest overall average RMSE of 0.0769.

The results of the experiments showed that the proposed Md-FKNNreg method achieved statistically significantly higher performance than the other methods in almost all cases in terms of the RMSE means. Additionally, we found that the Minkowski distance with \(p=1\) yielded the optimal Md-FKNNreg model, which then achieved the best performance for the majority of the datasets. In other words, the Man-FKNNreg showed promising results for the regression at large, supporting the indications in the study by Aggarwal et al. (2001).

However, it should be noted that the computational complexity of the proposed method is relatively high compared with the Euc-FKNNreg and KNNreg methods because an additional calculation with the Minkowski distance is included in the learning algorithm. Despite that, this research has offered some insight into further investigations. For example, it would be interesting to see how the Md-FKNNreg method adapts and performs in regression applications in which KNNreg was previously utilized (e.g., Hu et al. 2014; Cai et al. 2020; Yao and Ruzzo 2006; Huang and Perry 2016; Zhou et al. 2020; Durbin et al. 2021; Dell’Acqua et al. 2015). Furthermore, future research directions may also test the effect of combining Md-FKNNreg with other efficient variants, such as SVM (Chen and Hao 2017) and ANN (Salari et al. 2015) in a hybrid framework.