1 Introduction

Short-term electrical load forecasting forms a crucial foundation for the operation and planning of power systems, playing a significant role in ensuring their safety, stability, and economical performance [1,2,3]. With the deepening reform of electricity marketization, improving the accuracy and reliability of day-ahead electrical load forecasting is becoming increasingly significant for the decision-making and trading of electricity market participants. Achieving high precision in day-ahead electrical load forecasting is essential to address the complexities of the non-linear relationships influenced by multiple factors, such as meteorological, economic, societal, and policy variables [4,5,6].

To tackle this challenge, researchers have developed various forecasting technologies, which can be broadly categorized into three main types: classical methods, traditional methods, and intelligent methods. Each method has its unique advantages and applications. Classical forecasting methods, such as regression analysis and time series methods, rely on mathematical models to fit and infer historical data. They typically assume that data exhibit statistical characteristics, such as stationarity and linearity, and therefore may encounter difficulties when dealing with non-linear or non-stationary load data [7, 8]. Traditional forecasting methods, including load derivation, similar day, and exponential smoothing, mainly process load data based on empirical rules or statistical principles. These methods depend on the periodicity and regularity of data, and may not be suited for variable trends and abrupt changes in load data [9]. Intelligent forecasting methods, such as back-propagation (BP) neural network, Elman neural network, radial basis function (BRF) neural network, support vector regression (SVR), multilayer perceptron (MLP), extreme learning machine (ELM), and long short-term memory network (LSTM), use artificial intelligence technology to learn and reason from load data, showing powerful non-linear fitting capabilities and adaptability. However, they require a large amount of training data and parameter tuning, and the model structure and internal mechanisms may lack transparency [10,11,12,13,14,15].

As an intelligent forecasting method grounded in machine learning, the ELM operates as an effective single hidden layer feedforward neural network (SLFN). The ELM offers notable benefits, such as rapid learning, robust generalization capabilities, and a minimal number of parameters [16, 17]. However, the ELM also presents certain limitations, including model instability derived from the randomness of initial input weights and biases and a lack of flexibility caused by employing a single kernel function. In an effort to mitigate these challenges, researchers have proposed various enhancements to the MKELM. These enhancements include optimizing kernel function parameters and weights, which have seen significant applications in various fields. For instance, Xie and Wu introduced an innovative approach using a particle swarm optimization (PSO) algorithm [18]. Their research focused on the optimization of each kernel function’s parameters and weights. This development proved significant in maximizing the power point tracking algorithm of a photovoltaic (PV) system based on irradiance estimation. In another scholarly contribution, Kongsorot et al. leveraged the principles of fuzzy set theory to devise a kernel extreme learning machine that could tackle multi-label classification problems effectively [19]. This innovation in MKELM has offered new possibilities for handling complex data categorization tasks. Furthermore, Ahuja and Vishwakarma adopted a deterministic approach to MKELM, incorporating fuzzy feature extraction for pattern classifications [20]. Notably, their efforts aimed at resolving face recognition problems, demonstrating the advanced applications of MKELM methods. In addition to these advancements, recent studies have explored the integration of group acceptance sampling techniques to enhance the accuracy of prediction models, particularly in the context of quality control and reliability [21, 22]. These approaches underscore the potential of group acceptance sampling plans to refine prediction accuracy and reliability, providing valuable insights into their possible application in short-term electrical load forecasting.

Despite these advancements, however, the application of MKELM to load forecasting remains limited. Existing methods continue to demonstrate notable shortcomings, including an overreliance on feature selection methods, an arbitrary choice of kernel function types, and a low-efficiency parameter optimization. Thus, this paper proposes further investigation into load feature extraction methods and an improved differential evolution algorithm (DEA) for parameter optimization. By pursuing these improvements, we aim to enhance the accuracy of load forecasting utilizing MKELM.

As discussed above, in addition to choosing the right prediction model, selecting appropriate features is just as critical to improving the accuracy of day-ahead electricity load forecasting. Feature selection’s aim is to select features from a multitude of original features that have a significant impact on the prediction target. This process not only reduces the dimensionality and complexity of the model, but it also enhances the model’s generalization and interpretability. Feature selection methods mainly fall into three categories: filter methods, wrapper methods, and embedded methods. Filter methods focus on statistical characteristics of features, using an evaluation of the correlation between features and the prediction target for selection. Examples include correlation analysis (CA) [23], fuzzy rough sets (FRS) [24], and the mutual information (MI) method [25, 26]. The CA method identifies features that are highly correlated with target variables by measuring the linear correlation between each feature and the target variable. The FRS method generates fuzzy information granules based on different kernel functions and rates the feature’s contribution to the prediction target using fuzzy dependence. As a non-linear method, MI can capture various relationships between variables, including non-linear relationships. Wrapper methods, on the other hand, select subsets of features based on the performance of the prediction models, typically involving iterative searches to optimize the model’s forecasting effects. For example, the relief algorithm selects features by evaluating their ability to differentiate between nearby samples [27]. Embedded methods meld the advantages of filter and wrapper methods, automatically performing feature selection during the training of the model. For instance, multivariate feature selection criterion (MFSC) [28] evaluates the importance of features during training and selects the most contributive features.

Despite the strengths of these methods, they are not without flaws. The filtering method, while simple to compute, may overlook the interaction between features and is prone to redundancy, impacting prediction outcomes. The wrapper method takes into account the dependence of features but is computationally complex and susceptible to overfitting. The embedded method can combine model optimization but relies on a specific model, making the interpretation of feature importance a challenge. To address these issues, this paper introduces a two-stage feature selection method that combines FRS theory and a CA method, referred to as the FRSCA method. Initially, FRS is used to handle the uncertainty and fuzziness of data simultaneously by generating fuzzy information granules and evaluating the contribution of each feature to the prediction target. CA is then employed to assess the correlation among the initially screened features, thereby eliminating redundant features. Compared with the traditional filter methods, the FRSCA method proposed in this paper better captures non-linear relationships and interactions among features, reduces computational complexity and the risk of overfitting compared to wrapper methods, and improves interpretability and universality of feature selection compared with embedded methods [26, 29].

Building upon the analysis presented, this paper introduces a robust short-term electrical load forecasting framework that harnesses the strengths of FRS theory and MKELM. The main work and innovations of this paper are as follows: (1) A two-stage feature selection method based on fuzzy rough sets is proposed. FRS can produce different fuzzy information granules according to different kernel functions, and evaluate the contribution of each feature to the prediction target through the fuzzy dependency degree. In addition, redundant features are further eliminated by the correlation coefficient method. (2) A MKELM prediction model optimized by a differential evolution algorithm is introduced. This method can train different kernel functions based on different feature subsets, and the parameters and weights of each kernel function are optimized by the improved differential evolution algorithm. (3) Experimental verification has been conducted on actual electricity load data. The results demonstrate that the method proposed in this paper has high accuracy and reliability in day-ahead electricity load forecasting.

The structure of the rest of the paper is organized as follows: Sect. 2 introduces the theory of FRS and its application in load feature selection. Section 3 elaborates on the principles of MKELM, as well as the methods for optimizing kernel function parameters and weights. Section 4 showcases experiment design and result analysis. Finally, Sect. 5 summarizes the main conclusions of the paper and offers future research direction.

2 Principle of Load Feature Selection Based on Fuzzy Rough Set Theory

Feature selection is a critical step in data mining and machine learning. Its purpose is to eliminate redundant and irrelevant features, reduce data dimensions, enhance learning efficiency and generalization ability, while maintaining or improving learning accuracy. This paper employs a filter method based on FRS theory for load feature selection. The FRS theory is a feature selection method that relies on fuzzy similarity relations and variable precision approximation. It can effectively handle the uncertainty and fuzziness in the data with excellent interpretability.

2.1 Basic Concepts of Fuzzy Rough Set Theory

The basic definition of FRS theory is as follows [30, 31]:

Definition 1

(Fuzzy equivalence relation): Given a non-empty finite set U, for any two elements x, y ∈ U, if there exists a real number μR(x, y) ∈ [0,1], indicating the degree of similarity between x and y under relation R, and if this satisfies the following conditions:

$$ \left\{ {\begin{array}{*{20}c} {\mu_{R} \left( {x,x} \right) = 1} \\ {\mu_{R} \left( {x,y} \right) = \mu_{R} \left( {y,x} \right)} \\ {\mu_{R} \left( {x,z} \right) \ge min\left\{ {\mu_{R} \left( {x,y} \right),\mu_{R} \left( {y,z} \right)} \right\}} \\ \end{array} } \right.. $$
(1)

Then, R is termed as a fuzzy equivalence relation on U, and μR(x, y) is referred to as the similarity between x and y under relation R.

Definition 2

(Fuzzy equivalence class): Given a non-empty finite data set U and a fuzzy equivalence relation R, for any element x ∈ U, define

$$ \left[ x \right]_{R} = \left\{ {y \in U{|}\mu_{R} \left( {x,y} \right) > 0} \right\} $$
(2)

as the fuzzy equivalence class represented by x, and μR(x, y) is referred to as the membership degree of the element y to the fuzzy equivalence class [x]R.

Definition 3

(Fuzzy upper and lower approximations): Given a non-empty finite data set U, a fuzzy equivalence relation R, and a subset X ⊆ U, define

$$ \underline {R} \left( X \right) = \left\{ {x \in U{|}\mathop {\sup }\limits_{y \in X} \mu_{R} \left( {x,y} \right) = 1} \right\}, $$
(3)
$$ \overline{R}\left( X \right) = \left\{ {x \in U{|}\mathop {\inf }\limits_{y \in X} \mu_{R} \left( {x,y} \right) > 0} \right\} $$
(4)

as the fuzzy lower and upper approximations of X under the relation R, respectively.

Definition 4

(Fuzzy dependency): Given a non-empty finite data set U, a fuzzy equivalence relation R, and a decision attribute D, define

$$ \gamma_{R} \left( D \right) = \frac{{\mathop \sum \nolimits_{x \in U} \mathop {\sup }\limits_{y \in D\left( x \right)} \mu_{R} \left( {x,y} \right)}}{\left| U \right|} $$
(5)

as the fuzzy dependency of the decision attribute D on the relation R, where D(x) represents the set of elements with the same decision value as x, that is

$$ D\left( x \right) = \{ y \in U|D\left( y \right) = D\left( x \right)\} . $$
(6)

The fuzzy dependency reflects the contribution of an attribute or a subset of attributes to the decision attribute. The larger its value, the more important the attribute or subset of attributes is.

2.2 Feature Selection for Load Using FRS

In load forecasting problems, its common practice to extract useful features like periodic features, short-term trend features, temperature, daylight, holidays, etc., as input for the prediction model. However, not all features contribute to the forecasting objective; some might be redundant or irrelevant, contributing to the complexity of the model and reducing its generalization and interpretability. This paper utilizes a fuzzy rough set to implement the following steps for load feature extraction:

(1) Suppose U = {u1, u2um} is a sample space, comprising m samples; C = {c1, c2cn} is a candidate feature set, encompassing n candidate features; D is a decision attribute, representing the prediction objective. Based on the candidate feature set C and decision attribute D, an m × (n + 1) decision table, \(T = \left( {U,C \cup \left\{ D \right\}} \right)\), can be constructed, wherein each row represents a sample, and each column corresponds to an attribute.

(2) For each candidate feature \(c_{i} \in C\), select a suitable kernel function \(k_{i} \left( { \cdot , \cdot } \right)\), and compute the fuzzy relation \(\mu_{R,i} :U \times U \to \left[ {0,1} \right]\), over the sample space U, defined as follows:

$$ \mu_{R,i} \left( {x,y} \right) = k_{i} \left( {c_{i} \left( x \right),c_{i} \left( y \right)} \right), $$
(7)

where ci(x) denotes the value of the sample x in feature \(c_{i}\). Different kernel functions can reflect different levels of similarity. This paper uses the most common Gaussian kernel [32].

(3) Based on the fuzzy relation μR,i, the fuzzy dependence degree (FDD) of each candidate feature ci on the decision attribute D can be calculated, as follows:

$$ fdep\left( {c_{i} ,D} \right) = \frac{1}{m}\mathop \sum \limits_{x \in U} \mathop {\sup }\limits_{y \in D\left( x \right)} \left\{ {1 - \mu_{R,i} \left( {x,y} \right)} \right\}, $$
(8)

where D(x) denotes the collection of elements with the same decision value as x. The FDD reflects the degree of contribution of a candidate feature to the prediction objective; the higher it is, the more crucial the candidate feature.

(4) According to the FDD fdep(ci, D), sort all candidate features and set a threshold \(TH_{1} \in \left[ {0, 1} \right]\). Select candidate features with a fdep value greater than or equal to the threshold TH1 as the feature subset of next stage. The threshold TH1 can be adjusted according to the needs of practical problems. Generally, the greater the threshold, the smaller the feature subset. Another method is to directly choose the top N features with larger fdep after sorting by the values of fdep. To facilitate the setting of the same experimental conditions between different datasets, the second method is adopted in this paper.

2.3 Redundant Feature Elimination

It is observed in the experiments that the feature subset selected by the fuzzy rough set, either through the threshold TH1 or based on the top N features sorted by fdep value, might contain some redundant features. Redundant features not only fail to improve the accuracy of the prediction model but also increase the model’s training time. Therefore, this paper introduces the correlation coefficient method to further eliminate redundant features. In the training sample set, the correlation coefficient between the ith feature vector ci and the jth feature vector cj can be obtained from the following equation:

$$ \rho \left( {c_{i} ,c_{j} } \right) = \frac{{{\text{cov}} \left( {c_{i} ,c_{j} } \right)}}{{\sigma_{i} \cdot \sigma_{j} }} \left( {i,j \in n} \right), $$
(9)

where \(cov\left( {c_{i} ,c_{j} } \right)\) is the covariance between ci and cj; σi and σj are the standard deviations of vi and vj, respectively. \(\rho \left( {c_{i} ,c_{j} } \right)\) equaling to 1 or −1 indicates a complete positive or negative linear correlation between ci and cj, respectively; 0 indicates no correlation; Values between (−1, 0) or (0, 1) represent partial correlation. In other words, the higher the |\({\uprho }\left( {c_{i} ,c_{j} } \right)\)|, the greater the correlation between ci and cj. If the correlation coefficient between two candidate feature vectors is higher, it indicates a higher correlation between the two features, suggesting that they contain more similar information. Therefore, a threshold TH2 can be set; if the correlation coefficient between two feature vectors exceeds TH2, these two features are considered similar, and the feature with the smaller fdep value can be eliminated as redundant.

Based on the above analysis, the steps of the two-stage feature selection method adopted in this paper are as follows:

(1) First, calculate the FDD of all candidate features using formula (8), sort the FDDs from highest to lowest, and select the top N features with the highest fdep value as the candidate feature subset.

(2) Sort the candidate feature subset by fdep values from highest to lowest. The feature with the highest value is selected as the first input feature. Then, calculate the correlation coefficient between this feature and the features behind it. If the correlation coefficient between any feature and this feature exceeds the set threshold TH2, the corresponding feature can be considered redundant and eliminated without further comparison with other features.

(3) Next, calculate the correlation coefficient between the feature with the second-highest fdep value (if this feature is redundant, move to the next one) and the features behind it, and eliminate redundant features according to TH2. Continue this cycle until the redundant features of the feature with the smallest weight in the candidate feature subset are eliminated. Finally, the remaining features in the candidate feature subset will be used as the final input features for the prediction model.

3 Multi-kernel Extreme Learning Machine for Load Forecasting

3.1 Basic Concepts of ELM

The ELM is a single hidden layer feedforward neural network characterized by its rapid learning and good generalization capabilities. The main idea of ELM is to randomly generate weights and biases from the input layer to the hidden layer, and then analytically determine the weights of the output layer, thus avoiding the complexity and limitations of the gradient-based back-propagation algorithm commonly used in traditional neural networks.

Given a training set {(xi, ti)}N i = 1, where xi ∈ Rn is the input vector, ti ∈ Rm is the target vector, N is the number of samples, n is the input dimension, and m is the output dimension. The ELM can be expressed in the following form:

$$ y = f\left( x \right) = \mathop \sum \limits_{i = 1}^{L} \beta_{i} g\left( {w_{i} \cdot x + b_{i} } \right), $$
(10)

where y ∈ Rm is the output vector, f(x) is the prediction function, L is the number of hidden layer nodes, βi ∈ Rm is the weight from the ith hidden layer node to the output layer, g(·) is the activation function of the hidden layer nodes, wi ∈ Rn is the input weight vector of the ith hidden layer node, and bi ∈ R is the bias of the ith hidden layer node. Written in matrix form, the equation becomes:

$$ H\beta = T, $$
(11)

where T = [t1,t2,…,tN]T ∈ RN×m is the target matrix; β = [β1,β2,…,βL]T ∈ RL×m is the weight matrix from the hidden layer to the output layer; H ∈ RN×L is the hidden layer output matrix, which can be represented as:

$$ H = \left[ {\begin{array}{*{20}c} {g\left( {w_{1} \cdot x_{1} + b_{1} } \right)} & \cdots & {g\left( {w_{L} \cdot x_{1} + b_{L} } \right)} \\ \vdots & \ddots & \vdots \\ {g\left( {w_{1} \cdot x_{N} + b_{1} } \right)} & \cdots & {g\left( {w_{L} \cdot x_{N} + b_{L} } \right)} \\ \end{array} } \right]. $$
(12)

The goal of the ELM is to solve for β to minimize the output error, which is expressed as:

$$ \mathop {\min }\limits_{\beta } \left| {\left| {T - H\beta } \right|} \right|^{2} , $$
(13)

where \(\left| {\left| {\; \cdot \;} \right|} \right|\) denotes the norm. When H is a full-rank matrix, the equation has a unique solution:

$$ \beta = H^{\dag } T, $$
(14)

where \(H^{\dag }\) is the Moore–Penrose pseudo-inverse of H. When H is not a full-rank matrix, regularization methods such as ridge regression can be used to solve for β [33]:

$$ \beta = \left( {H^{T} H + \lambda I} \right)^{ - 1} H^{T} T, $$
(15)

where λ is the regularization parameter, and I is the identity matrix.

3.2 The Principle of MKELM

The MKELM is an extension of ELM based on multi-kernel learning, which utilizes multiple different kernel functions to enhance the performance and flexibility of ELM. The main idea of MKELM is to map input data into a high-dimensional feature space and then perform linear regression or classification in that space. The advantage of MKELM is that it avoids explicitly constructing the feature mapping function and instead implicitly calculates the inner product in the feature space through kernel functions.

Given a training set {(xi, ti)}N i = 1, assuming MKELM uses K different kernel functions {g1(·,·), g2(·,·), …, gK(·,·)}, then MKELM can be expressed as:

$$ y = f\left( x \right) = \mathop \sum \limits_{i = 1}^{K} d_{i} \mathop \sum \limits_{j = 1}^{N} \beta_{ij} g_{i} \left( {x,x_{j} } \right), $$
(16)

subject to constraints:

$$ \mathop \sum \limits_{i = 1}^{K} d_{i} = 1, d_{i} > 0, i = 1,2, \ldots ,K, $$
(17)

where di ∈ [0, 1] represents the weight coefficient of the ith kernel function; βij ∈ Rm is the weight vector from the jth sample to the output layer under the kth kernel function; gi(·,·) is the ith kernel function. The MKELM leverages a variety of kernel functions, including Gaussian, linear, polynomial, and Sigmoid, to enhance the model’s capability [18, 20]. These kernel functions are also utilized in this study to construct the MKELM model. If written in matrix form, the equation can be expressed as:

$$ \mathop \sum \limits_{i = 1}^{K} d_{i} H_{i} \beta_{i} = T, $$
(18)

where Hi ∈ RN×N is the hidden layer output matrix under the ith kernel function, βi = [βi1, βi2,…,βiN]T ∈ RN×m is the weight matrix under the ith kernel function, and T is the target matrix. The objective of MKELM is to solve the following optimization problem:

$$ \mathop {\min }\limits_{{\left\{ {\beta_{i} ,d_{i} } \right\}}} \mathop \sum \limits_{i = 1}^{K} d_{i} (||T - \mathop \sum \limits_{j = 1}^{K} d_{j} H_{j} \beta_{j} ||^{2} + \lambda d_{j} \beta_{j}^{2} ), $$
(19)

where λ is the regularization parameter. The optimization problem of MKELM can be divided into two sub-problems: one is to solve for βi with fixed di; the other is to solve for di with fixed βi. These two sub-problems can be iteratively alternated until convergence. For the first sub-problem, with fixed di, the optimization problem of MKELM can be simplified to:

$$ \mathop {\min }\limits_{{\beta_{i} }} ||\mathop \sum \limits_{i = 1}^{K} d_{i} H_{i} \beta_{i} - T||^{2} + \lambda \mathop \sum \limits_{i = 1}^{K} d_{j}^{2} \beta_{j}^{2} . $$
(20)

Using block matrix methods and matrix multiplication properties, this is a standard ELM problem, and its analytical solution is:

$$ \beta_{i} = \left( {H_{i}^{T} H_{i} + \lambda d_{i} I} \right)^{ - 1} H_{i}^{T} T. $$
(21)

To optimize the weights and parameters di and βi of each kernel function, one can fix βi and transform the optimization problem of MKELM into a non-linear programming problem as follows:

$$ \mathop {\min (}\limits_{{d_{i} ,\theta_{i} }} ||f\left( {x,d_{i} ,\theta_{i} } \right) - T||^{2} + \lambda d^{2} ), $$
(22)

where \(f\left( {x,d_{i} ,\theta_{i} } \right)\) is the prediction function of MKELM, which is defined as follows:

$$ f\left( {x,d_{i} ,\theta_{i} } \right) = \mathop \sum \limits_{i = 1}^{K} d_{i} H\left( {x,\theta_{i} } \right)\beta_{i} , $$
(23)

where θi represents the parameters of the ith kernel function, and \(H\left( {x,\theta_{i} } \right)\) denotes the hidden layer output matrix corresponding to the ith kernel function. Utilizing the Lagrange multiplier method and the Karush–Kuhn–Tucker (KKT) conditions, this optimization problem is a convex quadratic programming problem that can be solved using effective algorithms such as the gradient descent method, Newton’s method, etc. This paper has adopted an improved differential evolution algorithm to solve this problem, which is introduced as follows.

3.3 Parameters Optimization for MKELM

The performance and stability of MKELM depend on the selection of multiple parameters, including kernel function weights d and the parameters θ associated with each kernel function. These parameters cannot be obtained analytically and require optimization using heuristic search algorithms. In this study, we employ an improved differential evolution algorithm to optimize these parameters.

(1) Initialization: Randomly generate an initial population X = {x1, x2,…,xN} that contains N individuals, where each individual xi = {d1, d2,…,dK, θ1, θ2,…,θM} is a vector containing all parameters, θm represents the parameter of the mth kernel function, M represents the total number of kernel function parameters. At the same time, set the algorithm’s control parameters, such as mutation factor F, crossover probability CR, maximum iterations Gmax, etc.

(2) Evaluation: Calculate the fitness value f(xi) of each individual as an evaluation indicator of its optimization goal. This paper uses root mean square error (RMSE) as a fitness function, that is:

$$ f\left( {x_{i} } \right) = RMSE = \sqrt {\frac{1}{N}\mathop \sum \limits_{j = 1}^{N} \left( {y_{i} - t_{i} } \right)^{2} } , $$
(24)

where yj is the prediction value of MKELM for the jth sample, and tj is the actual value of the jth sample.

(3) Mutation: For each individual xi, randomly select three different individuals xa, xb, xc, and generate a mutation vector vi, as follows:

$$ v_{i} = x_{a} + F\left( {x_{b} - x_{c} } \right), $$
(25)

where F is the mutation factor, controlling the degree of mutation, which has a significant effect on the convergence speed and precision of the algorithm. Generally, when the mutation factor is larger, the algorithm has a stronger global search capability, but it tends to skip the optimal solution; when the mutation factor is smaller, the algorithm has a stronger local search capability, but it tends to fall into local optimum. To balance the abilities of global search and local search, an adaptive mutation factor strategy is adopted in this paper.

$$ F_{i} = F_{{{\text{min}}}} + \frac{{F_{{{\text{max}}}} - F_{{{\text{min}}}} }}{{1 + {\text{exp}}\left( { - \tau \left( {D_{i} - \mu_{D} } \right)} \right)}}, $$
(26)

where Fi is the mutation factor of the ith individual, Fmin and Fmax are the minimum and maximum values of the mutation factor, which in this paper are set as Fmin = 0.1 and Fmax = 0.9; τ is a regulating parameter, in this paper, it is set as τ = 0.5; Di is the Euclidean distance between the ith individual and the average individual of the population; μD is the average Euclidean distance between all individuals in the population and the average individual. Consequently, a higher population diversity results in a larger mutation factor, thereby bolstering the algorithm’s global search capabilities. Conversely, a lower population diversity leads to a smaller mutation factor, thus enhancing the algorithm’s local search capabilities.

(4) Crossover: For each mutation vector vi and its corresponding target vector xi, perform a crossover operation according to a certain crossover probability CR, generating a trial vector ui, as follows:

$$ u_{{ij}} = \left\{ \begin{gathered} v_{{ij}} ,\,\,\,\,{\text{rand}}_{j} \le CR\,\,or\,\,j = j_{r} \hfill \\ x_{{ij}} ,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,otherwise \hfill \\ \end{gathered} \right. $$
(27)

where randj is a random number between [0, 1], jr is a random integer between [1, K+M], ensuring that at least one parameter is crossed. The crossover probability is a parameter in the DE algorithm that controls the frequency of the crossover operation, which has an important impact on the exploration and exploitation ability of the algorithm. In order to balance the exploration and exploitation ability, this paper adopts a chaos mapping crossover probability strategy, which dynamically adjusts the size of the crossover probability according to the chaos mapping sequence. Specifically:

$$ CR_{i} = \frac{{CR_{{{\text{max}}}} - CR_{{{\text{min}}}} }}{2}\left( {1 + {\text{sin}}\left( {2\pi z_{i} } \right)} \right), $$
(28)

where CRmin and CRmax are the minimum and maximum of the crossover rate, and zi is a chaotic mapping sequence, its iterative formula is:

$$ z_{i + 1} = 4z_{i} \left( {1 - z_{i} } \right). $$
(29)

(5) Selection: For each trial vector ui and its corresponding target vector xi, compare their fitness values f(ui) and f(xi). If f(ui) < f(xi), then replace xi with ui; otherwise, keep xi.

$$ x_{i, G + 1} = \{ \begin{array}{*{20}c} {u_{i} ,} & {f\left( {u_{i} } \right) \le f\left( {x_{i} } \right)} \\ {x_{i} ,} & {{\text{otherwise}}} \\ \end{array} , $$
(30)

where G is the current iteration, and G+1 is the next iteration.

(6) Termination condition: If the maximum iteration number Gmax is reached or the fitness values of all individuals in the population are less than a certain threshold ε, then stop the iteration; otherwise, return to step (3).

(7) Output results: Choose the individual with the smallest fitness value from the final population as the optimal solution.

$$ x^{*} = {\text{arg}}\mathop {{\text{min}}}\limits_{x} f\left( x \right), $$
(31)

where x* = {d1*, d2*,…,dK*, θ1*, θ2*,…,θM*} is the optimal parameter vector. Use these parameters to construct the MKELM model, which will then be employed for short-term electric load forecasting.

3.4 Parameter Optimization for MKELM

Utilizing the FRSCA method for feature extraction, the diagram below depicts the workflow of short-term electrical load forecasting using an MKELM, which is optimized by an enhanced DEA (see Fig. 1).

Fig. 1
figure 1

Load forecasting workflow in this study

4 Experimental Design and Results Analysis

4.1 Experimental Data and Platform

To validate the proposed feature selection and forecasting methods, this study obtained load, electricity price, and meteorological data from the open power system data (OPSD) platform [34]. This platform provides power system data from various European countries, which can be used for experiments related to load forecasting and power dispatching. The dataset includes power and meteorological data from European countries collected at intervals of 15, 30, and 60 min, such as actual load, forecasted load, electricity price data, average temperature, direct radiation, and diffuse radiation. For feature selection and load forecasting experiments, this paper utilizes data from the year 2017 for Great Britain (GB) on a 60-min scale, totaling 8760 data points, primarily comprising historical load data and meteorological data. Before feature selection and load forecasting, all data undergo anomaly detection and processing. The approach adopted for all anomalous and missing data in this paper is linear averaging. To reduce algorithm runtime and accelerate convergence, all types of data are normalized to the [0, 1] range.

Based on the above data, this paper primarily conducts two types of load forecasting models. The first is a 1-h ahead prediction, where each forecast uses real values as input to predict the average load for the next moment. The second type is day-ahead prediction, which mainly forecasts the average load demand within the next 24 h. During the forecasting process, the forecasted values replace the known actual loads in the test data as input for the model. The simulation experiments in this paper are conducted on the following laptop platform, with a software platform of Windows 11+Matlab 2016b, and a hardware platform consisting of a Core i7 2.5G processor + 16 GB RAM + 500 GB SSD.

4.2 Evaluation Metrics

To assess the accuracy and stability of the forecasting methods, this paper employs the following three evaluation metrics:

Root mean square error (RMSE):

$$ {\text{RMSE}} = \sqrt {\frac{1}{N}\mathop \sum \limits_{i = 1}^{N} (y_{i} - t_{i} )^{2} .} $$
(32)

Mean absolute percentage error (MAPE):

$$ {\text{MAPE}} = \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} |\frac{{y_{i} - t_{i} }}{{t_{i} }}| \times 100\% . $$
(33)

R-squared value (R2_Score):

$$ R^{2} = 1 - \frac{{\mathop \sum \nolimits_{i} (y_{i} - t_{i} )^{2} }}{{\mathop \sum \nolimits_{i} (\overline{t}_{i} - t_{i} )^{2} }}, $$
(34)

where N is the number of samples, ti is the actual value, yi is the predicted value, and \(\overline{t}_{i}\) is the average value of the test samples. Among these metrics, the smaller the values of the first two indicators, the more accurate the forecast. An R2_Score closer to 1 indicates a more accurate forecast.

4.3 Load Feature Selection

In constructing short-term load forecasting models, the input features play a decisive role in the predictive outcome. After preprocessing the data, which includes anomaly detection and normalization, the FRSCA method described in this paper is employed to extract suitable input features. For the experimental data with a 1-h time scale, the load is treated as a time series. If the average load at time t is denoted by Lt, then Lt−1, Lt+1 and Lt−24 can represent the load data at the previous instant, the following instant, and the same instant on the previous day, respectively. To predict Lt, the candidate features selected are [Lt−1, Lt−2,…, Lt−168, Tempt, RadDirectt, RadDifft, isWeekend], where Tempt, RadDirectt, RadDifft and isWeekend represent the temperature, direct radiation, diffuse radiation, and whether it is the weekend at time t, respectively. Here, isWeekend is a binary variable where 1 indicates a weekend and 0 indicates a weekday. All candidate features are normalized to the [0, 1] interval. The objective of feature extraction is to select an appropriate number of features from the many candidates that contribute significantly to the forecasting model.

In this study, the experimental data are divided into four datasets, with the first three parts containing 2184 data points each, and the fourth part containing 2208 data points. Accordingly, the dates corresponding to each dataset are as follows: Dataset 1 corresponds to data from January 1 to March 1, 2017; Dataset 2 corresponds to dates from March 2 to July 1, 2017; Dataset 3 corresponds to dates from July 2 to September 30, 2017; Dataset 4 corresponds to dates from October 1 to December 31, 2017. Phase space reconstruction is performed on each part of the data, followed by the calculation of the fdep for each candidate feature according to the feature selection steps in Sect. 2.2. Table 1 presents the top 20 candidate features and their corresponding fdep values for each dataset.

Table 1 Features selected in the first phase by FRS

The results in Table 1 indicate that the range of fdep values for the selected features is between 0.6676 and 0.8237, suggesting a strong dependency relationship between these features and the target variable. Across the four datasets, temperature (Tempt) and direct radiation (RadDirectt) are almost consistently the most important features. Specifically, the temperature feature dominates in parts 1, 2, and 3 of the data, while the direct radiation feature stands out in parts 1, 3, and 4. These two factors significantly influence the load, likely related to the use of electrical devices such as air conditioning, especially during periods of high summer temperatures. Moreover, the consistency of all periodic features (such as Lt−24, Lt−48, Lt−168, etc.) indicates a clear periodicity in the load. This means that the load at the current moment has a strong correlation with the load at the same time on previous days, reflecting the regularity of people’s daily life and work patterns. The inclusion of features with high fuzzy dependency degrees, such as the load at one moment before the prediction (Lt−1) and the load at moments before and after various periodic times (such as Lt−23, Lt−167, etc.), further confirms the autocorrelation of the load. This autocorrelation is significant for predicting future load changes based on historical data.

When constructing a forecasting model, the number of input features significantly affects its predictive performance. Generally, too few input features may not express the non-linear relationship between inputs and outputs well, leading to suboptimal predictions. Conversely, too many input features can result in time-consuming training and may also decrease prediction accuracy. Therefore, to determine an appropriate number of features for the forecasting model’s input, this study selected a range from 5 to 20 features as model inputs and conducted day-ahead load forecasting experiments on Dataset 1 and Dataset 2. The results, as shown in Fig. 2, indicate that the prediction accuracy is relatively high when the number of input features is between 10 and 12. Both fewer and greater numbers of input features did not yield the best prediction outcomes.

Fig. 2
figure 2

MAPE values under different number of features in dataset1 and dataset 2

Therefore, after selecting preliminary features using the fuzzy rough set method, the correlation coefficient method described in Sect. 2.3 is employed to eliminate some redundant features, thus ensuring the forecasting model has an optimal number of input features. During the redundancy elimination process, the top 20 features based on fuzzy dependency degree were chosen as the initial screening features, with TH2 set to 0.94. Table 2 presents the final input features for each part of the data obtained through the two-stage feature selection by the FRSCA method, as well as the redundant features that were eliminated. It can be observed from Table 2 that almost all periodic features were retained, while some features before and after the periodic features were considered redundant and eliminated. For instance, in the dataset 1, Lt−73, Lt−71, and Lt−143, Lt−145 were, respectively, considered redundant to Lt−72 and Lt−144 and were, thus, removed.

Table 2 The final selected features in each dataset by FRSCA method

Additionally, for comparison purpose, Table 3 presents the features selected by other four different feature extraction methods (CA, MI, MFSC and Relief) in Dataset 1 (the selection results for other datasets are not listed as the top features are essentially the same as in Dataset 1). To have a similar number of feature inputs as the FRSCA method, each method selected the top 13 features based on their respective indices. Comparing the results in Table 3 with those in Table 2, it is observed that except for the MFSC method, which extracted the temperature feature, the other methods failed to effectively extract significant features such as temperature and radiation levels that have a substantial impact on the load forecasting.

Table 3 Features selected by other methods in Dataset 1
Table 4 Comparison of the day-ahead load forecasting metric of various algorithms

4.4 Load Forecasting Results and Comparative Analysis

In this paper, all forecasting models are built based on feature selection results. The forecasting experiments are divided into two scenarios: Scenario 1 compares the outcomes of the proposed DE-MKELM method with other forecasting techniques such as traditional MKELM (referred to as MKELM hereafter), RBF network, SVR, MLP, and LSTM under the same feature inputs. It is worth noting that in the traditional MKELM, the kernel function parameters are determined via cross-validation, while the kernel function weights are fixed and averaged. For instance, in this study, four kernel functions are used, each with a weight of 0.25. In Scenario 2, the predictive performance of models is compared and analyzed by using different feature selection methods under the same forecasting technique.

In the experiments, day-ahead load forecasting and one-hour-ahead load forecasting tests are conducted for Datasets 1 to 4. Each forecasting model uses 744 training samples (equivalent to 31 days of data), and the model’s effectiveness is verified by predicting the load for the following day based on the training results. For example, Dataset 1 contains 2184 load data points, but since there are 168 total candidate load features for feature extraction, only 2016 sample pairs can be constructed from this dataset. At this point, the output samples of the first 744 training samples in the dataset correspond to the load between January 8 and February 7, 2017, while the output of the prediction samples corresponds to the load on February 8, 2017. In the DE-MKELM method, the population size N of the DE algorithm is 50, and Gmax is 200.

In Test Scenario 1, the FRSCA method was initially employed for feature extraction from different datasets. Based on these features, the DE-MKELM method proposed in this study, along with other common forecasting models including MKELM, SVR, RBF, MLP, and LSTM, were applied to construct models for day-ahead forecasting and one-hour-ahead forecasting. Each method was independently executed 10 times to ensure the stability and reliability of the results. Figures 3 and 4 display the best single prediction results (i.e., the smallest MAPE values) obtained by each model for different datasets across these 10 runs. Tables 4, and 5 detail the average prediction errors (MAPE and RMSE) and accuracy (R2_Score) for each method across the 10 runs in each dataset.

Fig. 3
figure 3

Day-ahead load forecasting results by different algorithms across datasets

Fig. 4
figure 4

One hour-ahead load forecasting results by different algorithms across datasets

Table 5 Comparison of the hour-ahead load forecasting metric of various algorithms

As observed from Tables 4 and 5, in day-ahead load forecasting, the proposed DE-MKELM method demonstrates the best performance across all four datasets under the same input features in terms of MAPE, RMSE, and R2 Score. It boasts the lowest values for MAPE and RMSE, and the highest for R2 Score, indicating its superior prediction accuracy, minimal error, and enhanced model interpretability. The traditional MKELM method ranks second in forecasting accuracy, demonstrating that the MKELM method has certain advantages in load forecasting. This also indicates the effectiveness of using the DE algorithm to optimize MKELM’s kernel parameters and weights. Furthermore, the performances of the SVR, RBF, and MLP methods are at a similar level; however, none of them outperform the DE-MKELM. LSTM demonstrates the weakest performance in day-ahead load forecasting, especially on Dataset 1, with MAPE and RMSE values of 5.5629% and 2463.45, respectively, and an R2 Score of only 0.8089. It is further observed that while there is a significant difference in MAPE and RMSE among the various forecasting methods, the R2 Scores are not vastly different.

In the one-hour-ahead load forecasting, the proposed DE-MKELM similarly shows the most optimal results across all four datasets, with all methods’ forecasting indicators surpassing those of the day-ahead forecast, which is visually evident in Figs. 3 and 4. This is attributed to the need in day-ahead forecasting to replace the actual values in the test input with each predicted load. The traditional MKELM method still ranks second. And the other four methods’ one-hour-ahead forecasting indicators do not differ greatly, but overall, the SVR, RBF, and MLP methods slightly outperform LSTM, which may be due to LSTM’s weaker suitability for this type of forecasting task, or a need for further hyperparameter tuning to enhance its performance.

In the test of Scenario 2, the same forecasting method (DE-MKELM) is applied to different feature selection methods. To ensure the fairness of the experiments, the forecast days for the four datasets are kept consistent with Scenario 1. Figures 5 and 6 present the day-ahead and hour-ahead forecasting results with the smallest MAPE obtained using different feature selection methods for the four datasets. The results indicate that the accuracy of hour-ahead forecasting is generally superior to that of day-ahead forecasting. This is primarily because hour-ahead forecasting involves only single-step predictions, where each prediction directly utilizes the true values as input, while day-ahead forecasting resembles multi-step predictions, requiring the load values obtained from each prediction to be used as input for subsequent forecasts. This may lead to error accumulation, thereby reducing prediction accuracy. Although Figs. 5 and 6 provide an intuitive comparison of forecasting results, to more comprehensively evaluate the performance of various feature selection methods, Tables 5 and 6 detail the average values of indicators from ten independent runs.

Fig. 5
figure 5

Day-ahead load forecasting results by different feature selection methods across datasets

Fig. 6
figure 6

Hour-ahead load forecasting results by different feature selection methods across datasets

Table 6 Comparison of day-ahead load forecasting metric of various feature selection methods

The results from day-ahead forecasting presented in Table 6 reveal that the FRSCA method exhibits unparalleled superiority in the MAPE metric across all datasets, demonstrating its high precision in load forecasting. Specifically, its MAPE values range from 1.0545% to 2.6551%, significantly lower than the other methods, particularly the Relief method, which shows the highest MAPE values. The RMSE metric further supports this finding, with FRSCA displaying the lowest errors, thereby confirming its predictive accuracy. FRSCA’s R2_Score consistently remains above 0.97 across all datasets, reflecting its robustness in explaining the variance in load data. In Table 7, for the hour-ahead load forecasting, FRSCA once again exhibits the best performance, especially in Dataset 1, where its R2_Score reaches a high of 0.9972. Despite fluctuations in MAPE and RMSE across different datasets, FRSCA maintains a high R2_Score, indicating its consistency. The Relief method shows improvement in hour-ahead forecasting compared to day-ahead forecasting but still ranks lowest among the evaluated methods.

Table 7 Comparison of hour-ahead load forecasting metric of various feature selection methods

To further validate the superiority of the methods employed in this paper, an Analysis of Variance (ANOVA) was conducted on the ten runs of each forecasting method, followed by the Tukey–Kramer method [35] for subsequent multiple comparison. In Scenario 1, six forecasting models are compared, hence the numerator degrees of freedom df1 = 6 − 1 = 5, and the denominator degrees of freedom df2 = 6×10 − 6 = 54. The critical F-value in Scenario 1 is: Fcritical = finv(1−α, df1, df2) = 2.3861. In Scenario 2, five feature selection methods are compared, hence the numerator degrees of freedom df1 = 5 − 1 = 4, and the denominator degrees of freedom df2 = 5×10 − 5 = 45. The critical F-value in this Scenario is: Fcritical = finv(1−α, df1, df2) = 2.5787. In both scenarios, the significance level α is set at 0.05.

In test Scenario 1, the ANOVA results of each dataset are shown in Table 8. It provides the F-Value and p-Value for each dataset in the day-ahead load forecasting and hour-ahead load forecasting under this scenario. Similarly, in Test Scenario 2, the ANOVA test results for each dataset are also given in Table 9.

Table 8 The ANOVA results in test Scenario 1 (α = 0.05, Fcritical = 2.3861)
Table 9 The ANOVA results in test Scenario 2 (α = 0.05, Fcritical = 2.5787)

From Tables 8 and 9, it can be observed that in both test scenarios (Scenario 1 and Scenario 2), whether it is day-ahead forecasting or one-hour-ahead forecasting, the ANOVA results across different datasets show that the F-values are significantly larger than the Fcritical values, and the p-values are much smaller than the significance level α (0.05). This indicates that there are significant differences between different forecasting models and feature selection methods. However, despite the ANOVA results indicating these significant differences, they do not directly imply that the DE-MKELM and FRSCA methods proposed in this paper perform significantly better than the other methods. Therefore, further multiple comparison analyses are needed to clarify this. To begin, we will analyze the multiple comparison results of day-ahead forecasting in Test Scenario 1.

In Figs.7 and 8, the results of a multiple comparison for day-ahead forecasting models using MAPE and RMSE metrics are meticulously illustrated. Using the proposed DE-MKELM as the reference baseline (red dashed line), the other five methods—MKELM, SVR, RBF, MLP, and LSTM—were compared against the proposed DE-MKELM. Consequently, the mean differences and Tukey–Kramer confidence intervals (CIs) for these five methods were plotted in each sub-figure. If the CI line of a comparison method does not intersect with the baseline, it indicates that the results of the DE-MKELM method are significantly different from that comparison method. Therefore, it can be observed from the results that the DE-MKELM method demonstrates a significant advantage in both MAPE and RMSE metrics compared to other algorithms, thereby validating the significance and superiority of the proposed method in terms of forecasting accuracy.

Fig. 7
figure 7

ANOVA and multiple comparison results of MAPE metric in day-ahead prediction for various algorithms

Fig. 8
figure 8

ANOVA and multiple comparison results of RMSE metric in day-ahead prediction for various algorithms

In Fig. 9, the multiple comparison of the R2_Score metric obtained from day-ahead predictions using different forecasting models is also presented. From Table 8, it can be seen that although there are significant differences among the six algorithms across all four datasets (with p < 0.05 and F-Value > 2.3861), the multiple comparisons of the R2_Score metric in Fig. 9 indicate that the confidence interval lines intersect with the reference baseline in all four datasets. For instance, the confidence interval lines for MKELM, SVR, RBF, and MLP in Dataset 1, as well as MKELM and MLP in Dataset 2, intersect with the reference baseline. This suggests that their R2_Score results do not significantly differ from those of the proposed DE-MKELM. This observation is also reflected in Tables 3, 4, where the average R2_Scores for the proposed MKELM, SVR, RBF, and MLP in Dataset 1 are 0.9947, 0.9639, 0.9717, and 0.9660, respectively, showing no marked difference. In fact, the R2_Score measures the predictive values’ ability to explain the variability of the actual values, rather than simply measuring the gap between the predicted and actual values. A higher R2_Score indicates that the model captures the data trend well, but this does not necessarily mean that the model’s predictions are accurate (i.e., the error is small). For example, a model may fit the data well with a high R2_Score, but it could also have a large mean absolute percentage error (MAPE) and variance (RMSE). Conversely, another model may have a small prediction error but a low R2_Score because it may not capture all the variability in the data as effectively.

Fig. 9
figure 9

ANOVA and multiple comparison results of R2_Score metric in day-ahead prediction for various algorithms

To enhance the understanding of the efficiency of the methods proposed in this study, this paper will discuss in detail the multiple comparison outcomes for the hour-ahead predictions in Scenario 1 and the two types of predictions in Scenario 2, with specific illustrations distributed from Figs. 10, 11, 12, 13, 14, 15. It is noteworthy that, given the R2_Score primarily evaluates the fit of variability between predicted and actual values, its ANOVA results are not discussed in this analysis. Instead, an in-depth exploration of algorithm performance is conducted solely based on the MAPE and RMSE metrics.

Fig. 10
figure 10

ANOVA and multiple comparison results of MAPE in hour-ahead prediction for various algorithms

Fig. 11
figure 11

ANOVA and multiple comparison results of RMSE in hour-ahead prediction for various algorithms

Fig. 12
figure 12

ANOVA and multiple comparison results of MAPE in day-ahead prediction for various feature selection methods

Fig. 13
figure 13

ANOVA and multiple comparison results of RMSE in day-ahead prediction for various feature selection methods

Fig. 14
figure 14

ANOVA and multiple comparison results of MAPE in hour-ahead prediction for various feature selection methods

Fig. 15
figure 15

ANOVA and multiple comparison results of RMSE in hour-ahead prediction for various feature selection methods

Figures 10 and 11 present the ANOVA and multiple comparison results for several forecasting algorithms in hour-ahead predictions in Scenario 1, based on the MAPE and RMSE metrics. These results indicate that hour-ahead predictions exhibit more significant differences than day-ahead load forecasting, as evidenced by lower p-values and higher F-Values. Multiple comparisons across various datasets underscore the significant advantages of the proposed MKELM algorithm over the other four reference algorithms. Additionally, Figs. 12, 13, 14, 15 detail the ANOVA results for day-ahead and hour-ahead predictions’ MAPE and RMSE under Scenario 2, with different feature selection methods, offering insights into the role of feature selection strategies in improving the accuracy of forecasting models.

The results from Figs. 12, 13, 14, 15 indicate that in the comparison of feature selection algorithms, tests across all datasets showed smaller p-values and larger F-Values, signifying statistically significant differences between the outcomes of each feature selection method. The multiple comparison results, for both day-ahead and hour-ahead predictions, consistently demonstrate that the FRSCA method proposed in this paper significantly outperforms the other four feature selection methods across all datasets. Moreover, in the predictions of Scenario 2, the confidence intervals for the compared feature selection methods are shorter, particularly in Datasets 1 and 2. The length of confidence intervals is commonly used to measure the uncertainty of prediction results. Shorter confidence intervals imply more concentrated prediction outcomes, indicating more consistent and stable model predictions. Therefore, this phenomenon suggests that under the same input feature conditions, the variance between the ten prediction results using the proposed MKELM method is smaller, further proving the strong stability and reliability of the proposed MKELM method.

5 Conclusion

This study proposes a forecasting method for electric load forecasting that integrates FRS theory with MKELM optimized by DEA (DE-MKELM). The introduction of FRS theory offers an effective feature selection tool that enhances feature relevance and reduces computational complexity through a two-stage filtering process. The proposed DE-MKELM constructs a robust forecasting model, with a multi-kernel structure and parameter optimization strategy that adapts well to feature variations.

Simulation experiments utilized actual electric load data from the OPSD platform for validation. Results indicate that the proposed DE-MKELM method excels in short-term electric load forecasting, showing significant advantages over common prediction methods in MAPE, RMSE, and R2_Score metrics. The FRSCA method captures non-linear relationships and interactions between features, extracting periodic, short-term trends, temperature, and radiation features while eliminating redundant features. Results confirm that features extracted using the proposed method achieve superior prediction accuracy. ANOVA and multiple comparisons further validate the statistical significance and superiority of the proposed methods.

Future research may optimize feature selection strategies and prediction models, including the introduction of Deep Learning (DL) methods and predictions for complex electric power system data, encompassing distributed generation and renewable energy integration. These efforts aim to provide robust technical support for the stable operation and intelligent management of power systems.