Short-Term Electrical Load Forecasting Based on Fuzzy Rough Set Feature Selection and Multi-kernel Extreme Learning Machine

Liu, Gang; Jin, ZhiYuan; Yu, HongZhong

doi:10.1007/s44196-024-00577-7

Short-Term Electrical Load Forecasting Based on Fuzzy Rough Set Feature Selection and Multi-kernel Extreme Learning Machine

Research Article
Open access
Published: 24 June 2024

Volume 17, article number 160, (2024)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Computational Intelligence Systems Aims and scope Submit manuscript

Short-Term Electrical Load Forecasting Based on Fuzzy Rough Set Feature Selection and Multi-kernel Extreme Learning Machine

Download PDF

163 Accesses
Explore all metrics

Abstract

As the complexity of power systems increases, accurate load forecasting becomes crucial. This paper proposes a method for short-term electrical load forecasting that integrates fuzzy rough set (FRS) theory and multi-kernel extreme learning machine (MKELM) to improve both the accuracy and reliability of load predictions. First, we introduce the FRS theory for pre-selecting features. Next, we use correlation analysis (CA) to get rid of redundant features and choose the most important ones as prediction targets. Second, we introduce a novel prediction model based on the multi-kernel extreme learning machine (MKELM), utilizing an enhanced differential evolution algorithm (DEA) to optimize the kernel function’s parameters and the model’s weights. This approach allows for effective adaptation to various feature subsets. Experimental results on actual power load data demonstrate that our approach achieves high accuracy and reliability in short-term load forecasting. Moreover, comparative evaluations reveal that the proposed method outperforms alternative prediction models on key metrics. ANOVA and multiple comparisons further validate the statistical significance and superiority of the proposed method.

Short-Term Forecast Based on Generalized Maximum Correntropy Criterion and Kernel Extreme Learning Machine

Short-term load forecasting method based on fuzzy optimization combined model of load feature recognition

Article Open access 20 June 2024

A study of hybrid data selection method for a wavelet SVR mid-term load forecasting model

Article 02 September 2017

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Short-term electrical load forecasting forms a crucial foundation for the operation and planning of power systems, playing a significant role in ensuring their safety, stability, and economical performance [1,2,3]. With the deepening reform of electricity marketization, improving the accuracy and reliability of day-ahead electrical load forecasting is becoming increasingly significant for the decision-making and trading of electricity market participants. Achieving high precision in day-ahead electrical load forecasting is essential to address the complexities of the non-linear relationships influenced by multiple factors, such as meteorological, economic, societal, and policy variables [4,5,6].

To tackle this challenge, researchers have developed various forecasting technologies, which can be broadly categorized into three main types: classical methods, traditional methods, and intelligent methods. Each method has its unique advantages and applications. Classical forecasting methods, such as regression analysis and time series methods, rely on mathematical models to fit and infer historical data. They typically assume that data exhibit statistical characteristics, such as stationarity and linearity, and therefore may encounter difficulties when dealing with non-linear or non-stationary load data [7, 8]. Traditional forecasting methods, including load derivation, similar day, and exponential smoothing, mainly process load data based on empirical rules or statistical principles. These methods depend on the periodicity and regularity of data, and may not be suited for variable trends and abrupt changes in load data [9]. Intelligent forecasting methods, such as back-propagation (BP) neural network, Elman neural network, radial basis function (BRF) neural network, support vector regression (SVR), multilayer perceptron (MLP), extreme learning machine (ELM), and long short-term memory network (LSTM), use artificial intelligence technology to learn and reason from load data, showing powerful non-linear fitting capabilities and adaptability. However, they require a large amount of training data and parameter tuning, and the model structure and internal mechanisms may lack transparency [10,11,12,13,14,15].

As an intelligent forecasting method grounded in machine learning, the ELM operates as an effective single hidden layer feedforward neural network (SLFN). The ELM offers notable benefits, such as rapid learning, robust generalization capabilities, and a minimal number of parameters [16, 17]. However, the ELM also presents certain limitations, including model instability derived from the randomness of initial input weights and biases and a lack of flexibility caused by employing a single kernel function. In an effort to mitigate these challenges, researchers have proposed various enhancements to the MKELM. These enhancements include optimizing kernel function parameters and weights, which have seen significant applications in various fields. For instance, Xie and Wu introduced an innovative approach using a particle swarm optimization (PSO) algorithm [18]. Their research focused on the optimization of each kernel function’s parameters and weights. This development proved significant in maximizing the power point tracking algorithm of a photovoltaic (PV) system based on irradiance estimation. In another scholarly contribution, Kongsorot et al. leveraged the principles of fuzzy set theory to devise a kernel extreme learning machine that could tackle multi-label classification problems effectively [19]. This innovation in MKELM has offered new possibilities for handling complex data categorization tasks. Furthermore, Ahuja and Vishwakarma adopted a deterministic approach to MKELM, incorporating fuzzy feature extraction for pattern classifications [20]. Notably, their efforts aimed at resolving face recognition problems, demonstrating the advanced applications of MKELM methods. In addition to these advancements, recent studies have explored the integration of group acceptance sampling techniques to enhance the accuracy of prediction models, particularly in the context of quality control and reliability [21, 22]. These approaches underscore the potential of group acceptance sampling plans to refine prediction accuracy and reliability, providing valuable insights into their possible application in short-term electrical load forecasting.

Despite these advancements, however, the application of MKELM to load forecasting remains limited. Existing methods continue to demonstrate notable shortcomings, including an overreliance on feature selection methods, an arbitrary choice of kernel function types, and a low-efficiency parameter optimization. Thus, this paper proposes further investigation into load feature extraction methods and an improved differential evolution algorithm (DEA) for parameter optimization. By pursuing these improvements, we aim to enhance the accuracy of load forecasting utilizing MKELM.

As discussed above, in addition to choosing the right prediction model, selecting appropriate features is just as critical to improving the accuracy of day-ahead electricity load forecasting. Feature selection’s aim is to select features from a multitude of original features that have a significant impact on the prediction target. This process not only reduces the dimensionality and complexity of the model, but it also enhances the model’s generalization and interpretability. Feature selection methods mainly fall into three categories: filter methods, wrapper methods, and embedded methods. Filter methods focus on statistical characteristics of features, using an evaluation of the correlation between features and the prediction target for selection. Examples include correlation analysis (CA) [23], fuzzy rough sets (FRS) [24], and the mutual information (MI) method [25, 26]. The CA method identifies features that are highly correlated with target variables by measuring the linear correlation between each feature and the target variable. The FRS method generates fuzzy information granules based on different kernel functions and rates the feature’s contribution to the prediction target using fuzzy dependence. As a non-linear method, MI can capture various relationships between variables, including non-linear relationships. Wrapper methods, on the other hand, select subsets of features based on the performance of the prediction models, typically involving iterative searches to optimize the model’s forecasting effects. For example, the relief algorithm selects features by evaluating their ability to differentiate between nearby samples [27]. Embedded methods meld the advantages of filter and wrapper methods, automatically performing feature selection during the training of the model. For instance, multivariate feature selection criterion (MFSC) [28] evaluates the importance of features during training and selects the most contributive features.

Despite the strengths of these methods, they are not without flaws. The filtering method, while simple to compute, may overlook the interaction between features and is prone to redundancy, impacting prediction outcomes. The wrapper method takes into account the dependence of features but is computationally complex and susceptible to overfitting. The embedded method can combine model optimization but relies on a specific model, making the interpretation of feature importance a challenge. To address these issues, this paper introduces a two-stage feature selection method that combines FRS theory and a CA method, referred to as the FRSCA method. Initially, FRS is used to handle the uncertainty and fuzziness of data simultaneously by generating fuzzy information granules and evaluating the contribution of each feature to the prediction target. CA is then employed to assess the correlation among the initially screened features, thereby eliminating redundant features. Compared with the traditional filter methods, the FRSCA method proposed in this paper better captures non-linear relationships and interactions among features, reduces computational complexity and the risk of overfitting compared to wrapper methods, and improves interpretability and universality of feature selection compared with embedded methods [26, 29].

Building upon the analysis presented, this paper introduces a robust short-term electrical load forecasting framework that harnesses the strengths of FRS theory and MKELM. The main work and innovations of this paper are as follows: (1) A two-stage feature selection method based on fuzzy rough sets is proposed. FRS can produce different fuzzy information granules according to different kernel functions, and evaluate the contribution of each feature to the prediction target through the fuzzy dependency degree. In addition, redundant features are further eliminated by the correlation coefficient method. (2) A MKELM prediction model optimized by a differential evolution algorithm is introduced. This method can train different kernel functions based on different feature subsets, and the parameters and weights of each kernel function are optimized by the improved differential evolution algorithm. (3) Experimental verification has been conducted on actual electricity load data. The results demonstrate that the method proposed in this paper has high accuracy and reliability in day-ahead electricity load forecasting.

The structure of the rest of the paper is organized as follows: Sect. 2 introduces the theory of FRS and its application in load feature selection. Section 3 elaborates on the principles of MKELM, as well as the methods for optimizing kernel function parameters and weights. Section 4 showcases experiment design and result analysis. Finally, Sect. 5 summarizes the main conclusions of the paper and offers future research direction.

2 Principle of Load Feature Selection Based on Fuzzy Rough Set Theory

Feature selection is a critical step in data mining and machine learning. Its purpose is to eliminate redundant and irrelevant features, reduce data dimensions, enhance learning efficiency and generalization ability, while maintaining or improving learning accuracy. This paper employs a filter method based on FRS theory for load feature selection. The FRS theory is a feature selection method that relies on fuzzy similarity relations and variable precision approximation. It can effectively handle the uncertainty and fuzziness in the data with excellent interpretability.

2.1 Basic Concepts of Fuzzy Rough Set Theory

The basic definition of FRS theory is as follows [30, 31]:

Definition 1

(Fuzzy equivalence relation): Given a non-empty finite set U, for any two elements x, y ∈ U, if there exists a real number μ_R(x, y) ∈ [0,1], indicating the degree of similarity between x and y under relation R, and if this satisfies the following conditions:

$$ \left\{ {\begin{array}{*{20}c} {\mu_{R} \left( {x,x} \right) = 1} \\ {\mu_{R} \left( {x,y} \right) = \mu_{R} \left( {y,x} \right)} \\ {\mu_{R} \left( {x,z} \right) \ge min\left\{ {\mu_{R} \left( {x,y} \right),\mu_{R} \left( {y,z} \right)} \right\}} \\ \end{array} } \right.. $$

(1)

Then, R is termed as a fuzzy equivalence relation on U, and μ_R(x, y) is referred to as the similarity between x and y under relation R.

Definition 2

(Fuzzy equivalence class): Given a non-empty finite data set U and a fuzzy equivalence relation R, for any element x ∈ U, define

$$ \left[ x \right]_{R} = \left\{ {y \in U{|}\mu_{R} \left( {x,y} \right) > 0} \right\} $$

(2)

as the fuzzy equivalence class represented by x, and μ_R(x, y) is referred to as the membership degree of the element y to the fuzzy equivalence class [x]_R.

Definition 3

(Fuzzy upper and lower approximations): Given a non-empty finite data set U, a fuzzy equivalence relation R, and a subset X ⊆ U, define

$$ \underline {R} \left( X \right) = \left\{ {x \in U{|}\mathop {\sup }\limits_{y \in X} \mu_{R} \left( {x,y} \right) = 1} \right\}, $$

(3)

$$ \overline{R}\left( X \right) = \left\{ {x \in U{|}\mathop {\inf }\limits_{y \in X} \mu_{R} \left( {x,y} \right) > 0} \right\} $$

(4)

as the fuzzy lower and upper approximations of X under the relation R, respectively.

Definition 4

(Fuzzy dependency): Given a non-empty finite data set U, a fuzzy equivalence relation R, and a decision attribute D, define

$$ \gamma_{R} \left( D \right) = \frac{{\mathop \sum \nolimits_{x \in U} \mathop {\sup }\limits_{y \in D\left( x \right)} \mu_{R} \left( {x,y} \right)}}{\left| U \right|} $$

(5)

as the fuzzy dependency of the decision attribute D on the relation R, where D(x) represents the set of elements with the same decision value as x, that is

$$ D\left( x \right) = \{ y \in U|D\left( y \right) = D\left( x \right)\} . $$

(6)

The fuzzy dependency reflects the contribution of an attribute or a subset of attributes to the decision attribute. The larger its value, the more important the attribute or subset of attributes is.

2.2 Feature Selection for Load Using FRS

In load forecasting problems, its common practice to extract useful features like periodic features, short-term trend features, temperature, daylight, holidays, etc., as input for the prediction model. However, not all features contribute to the forecasting objective; some might be redundant or irrelevant, contributing to the complexity of the model and reducing its generalization and interpretability. This paper utilizes a fuzzy rough set to implement the following steps for load feature extraction:

(1) Suppose U = {u₁, u₂… u_m} is a sample space, comprising m samples; C = {c₁, c2…c_n} is a candidate feature set, encompassing n candidate features; D is a decision attribute, representing the prediction objective. Based on the candidate feature set C and decision attribute D, an m × (n + 1) decision table, $T = \left( {U,C \cup \left\{ D \right\}} \right)$, can be constructed, wherein each row represents a sample, and each column corresponds to an attribute.

(2) For each candidate feature $c_{i} \in C$, select a suitable kernel function $k_{i} \left( { \cdot , \cdot } \right)$, and compute the fuzzy relation $\mu_{R,i} :U \times U \to \left[ {0,1} \right]$, over the sample space U, defined as follows:

$$ \mu_{R,i} \left( {x,y} \right) = k_{i} \left( {c_{i} \left( x \right),c_{i} \left( y \right)} \right), $$

(7)

where c_i(x) denotes the value of the sample x in feature $c_{i}$. Different kernel functions can reflect different levels of similarity. This paper uses the most common Gaussian kernel [32].

(3) Based on the fuzzy relation μ_R,i, the fuzzy dependence degree (FDD) of each candidate feature c_i on the decision attribute D can be calculated, as follows:

$$ fdep\left( {c_{i} ,D} \right) = \frac{1}{m}\mathop \sum \limits_{x \in U} \mathop {\sup }\limits_{y \in D\left( x \right)} \left\{ {1 - \mu_{R,i} \left( {x,y} \right)} \right\}, $$

(8)

where D(x) denotes the collection of elements with the same decision value as x. The FDD reflects the degree of contribution of a candidate feature to the prediction objective; the higher it is, the more crucial the candidate feature.

(4) According to the FDD fdep(c_i, D), sort all candidate features and set a threshold $TH_{1} \in \left[ {0, 1} \right]$. Select candidate features with a fdep value greater than or equal to the threshold TH₁ as the feature subset of next stage. The threshold TH₁ can be adjusted according to the needs of practical problems. Generally, the greater the threshold, the smaller the feature subset. Another method is to directly choose the top N features with larger fdep after sorting by the values of fdep. To facilitate the setting of the same experimental conditions between different datasets, the second method is adopted in this paper.

2.3 Redundant Feature Elimination

It is observed in the experiments that the feature subset selected by the fuzzy rough set, either through the threshold TH₁ or based on the top N features sorted by fdep value, might contain some redundant features. Redundant features not only fail to improve the accuracy of the prediction model but also increase the model’s training time. Therefore, this paper introduces the correlation coefficient method to further eliminate redundant features. In the training sample set, the correlation coefficient between the ith feature vector c_i and the jth feature vector c_j can be obtained from the following equation:

$$ \rho \left( {c_{i} ,c_{j} } \right) = \frac{{{\text{cov}} \left( {c_{i} ,c_{j} } \right)}}{{\sigma_{i} \cdot \sigma_{j} }} \left( {i,j \in n} \right), $$

(9)

where $cov\left( {c_{i} ,c_{j} } \right)$ is the covariance between c_i and c_j; σ_i and σ_j are the standard deviations of v_i and v_j, respectively. $\rho \left( {c_{i} ,c_{j} } \right)$ equaling to 1 or −1 indicates a complete positive or negative linear correlation between c_i and c_j, respectively; 0 indicates no correlation; Values between (−1, 0) or (0, 1) represent partial correlation. In other words, the higher the |${\uprho }\left( {c_{i} ,c_{j} } \right)$|, the greater the correlation between c_i and c_j. If the correlation coefficient between two candidate feature vectors is higher, it indicates a higher correlation between the two features, suggesting that they contain more similar information. Therefore, a threshold TH₂ can be set; if the correlation coefficient between two feature vectors exceeds TH₂, these two features are considered similar, and the feature with the smaller fdep value can be eliminated as redundant.

Based on the above analysis, the steps of the two-stage feature selection method adopted in this paper are as follows:

(1) First, calculate the FDD of all candidate features using formula (8), sort the FDDs from highest to lowest, and select the top N features with the highest fdep value as the candidate feature subset.

(2) Sort the candidate feature subset by fdep values from highest to lowest. The feature with the highest value is selected as the first input feature. Then, calculate the correlation coefficient between this feature and the features behind it. If the correlation coefficient between any feature and this feature exceeds the set threshold TH₂, the corresponding feature can be considered redundant and eliminated without further comparison with other features.

(3) Next, calculate the correlation coefficient between the feature with the second-highest fdep value (if this feature is redundant, move to the next one) and the features behind it, and eliminate redundant features according to TH₂. Continue this cycle until the redundant features of the feature with the smallest weight in the candidate feature subset are eliminated. Finally, the remaining features in the candidate feature subset will be used as the final input features for the prediction model.

3 Multi-kernel Extreme Learning Machine for Load Forecasting

3.1 Basic Concepts of ELM

The ELM is a single hidden layer feedforward neural network characterized by its rapid learning and good generalization capabilities. The main idea of ELM is to randomly generate weights and biases from the input layer to the hidden layer, and then analytically determine the weights of the output layer, thus avoiding the complexity and limitations of the gradient-based back-propagation algorithm commonly used in traditional neural networks.

Given a training set {(x_i, t_i)}N i = 1, where x_i ∈ Rⁿ is the input vector, t_i ∈ R^m is the target vector, N is the number of samples, n is the input dimension, and m is the output dimension. The ELM can be expressed in the following form:

$$ y = f\left( x \right) = \mathop \sum \limits_{i = 1}^{L} \beta_{i} g\left( {w_{i} \cdot x + b_{i} } \right), $$

(10)

where y ∈ R^m is the output vector, f(x) is the prediction function, L is the number of hidden layer nodes, β_i ∈ R^m is the weight from the ith hidden layer node to the output layer, g(·) is the activation function of the hidden layer nodes, w_i ∈ Rⁿ is the input weight vector of the ith hidden layer node, and b_i ∈ R is the bias of the ith hidden layer node. Written in matrix form, the equation becomes:

$$ H\beta = T, $$

(11)

where T = [t₁,t₂,…,t_N]^T ∈ R^N×m is the target matrix; β = [β₁,β₂,…,β_L]^T ∈ R^L×m is the weight matrix from the hidden layer to the output layer; H ∈ R^N×L is the hidden layer output matrix, which can be represented as:

$$ H = \left[ {\begin{array}{*{20}c} {g\left( {w_{1} \cdot x_{1} + b_{1} } \right)} & \cdots & {g\left( {w_{L} \cdot x_{1} + b_{L} } \right)} \\ \vdots & \ddots & \vdots \\ {g\left( {w_{1} \cdot x_{N} + b_{1} } \right)} & \cdots & {g\left( {w_{L} \cdot x_{N} + b_{L} } \right)} \\ \end{array} } \right]. $$

(12)

The goal of the ELM is to solve for β to minimize the output error, which is expressed as:

$$ \mathop {\min }\limits_{\beta } \left| {\left| {T - H\beta } \right|} \right|^{2} , $$

(13)

where $\left| {\left| {\; \cdot \;} \right|} \right|$ denotes the norm. When H is a full-rank matrix, the equation has a unique solution:

$$ \beta = H^{\dag } T, $$

(14)

where $H^{\dag }$ is the Moore–Penrose pseudo-inverse of H. When H is not a full-rank matrix, regularization methods such as ridge regression can be used to solve for β [33]:

$$ \beta = \left( {H^{T} H + \lambda I} \right)^{ - 1} H^{T} T, $$

(15)

where λ is the regularization parameter, and I is the identity matrix.

3.2 The Principle of MKELM

The MKELM is an extension of ELM based on multi-kernel learning, which utilizes multiple different kernel functions to enhance the performance and flexibility of ELM. The main idea of MKELM is to map input data into a high-dimensional feature space and then perform linear regression or classification in that space. The advantage of MKELM is that it avoids explicitly constructing the feature mapping function and instead implicitly calculates the inner product in the feature space through kernel functions.

Given a training set {(x_i, t_i)}N i = 1, assuming MKELM uses K different kernel functions {g₁(·,·), g₂(·,·), …, g_K(·,·)}, then MKELM can be expressed as:

$$ y = f\left( x \right) = \mathop \sum \limits_{i = 1}^{K} d_{i} \mathop \sum \limits_{j = 1}^{N} \beta_{ij} g_{i} \left( {x,x_{j} } \right), $$

(16)

subject to constraints:

$$ \mathop \sum \limits_{i = 1}^{K} d_{i} = 1, d_{i} > 0, i = 1,2, \ldots ,K, $$

(17)

where d_i ∈ [0, 1] represents the weight coefficient of the ith kernel function; β_ij ∈ R^m is the weight vector from the jth sample to the output layer under the kth kernel function; g_i(·,·) is the ith kernel function. The MKELM leverages a variety of kernel functions, including Gaussian, linear, polynomial, and Sigmoid, to enhance the model’s capability [18, 20]. These kernel functions are also utilized in this study to construct the MKELM model. If written in matrix form, the equation can be expressed as:

$$ \mathop \sum \limits_{i = 1}^{K} d_{i} H_{i} \beta_{i} = T, $$

(18)

where H_i ∈ R^N×N is the hidden layer output matrix under the ith kernel function, β_i = [β_i1, β_i2,…,β_iN]^T ∈ R^N×m is the weight matrix under the ith kernel function, and T is the target matrix. The objective of MKELM is to solve the following optimization problem:

$$ \mathop {\min }\limits_{{\left\{ {\beta_{i} ,d_{i} } \right\}}} \mathop \sum \limits_{i = 1}^{K} d_{i} (||T - \mathop \sum \limits_{j = 1}^{K} d_{j} H_{j} \beta_{j} ||^{2} + \lambda d_{j} \beta_{j}^{2} ), $$

(19)

where λ is the regularization parameter. The optimization problem of MKELM can be divided into two sub-problems: one is to solve for β_i with fixed d_i; the other is to solve for d_i with fixed β_i. These two sub-problems can be iteratively alternated until convergence. For the first sub-problem, with fixed d_i, the optimization problem of MKELM can be simplified to:

$$ \mathop {\min }\limits_{{\beta_{i} }} ||\mathop \sum \limits_{i = 1}^{K} d_{i} H_{i} \beta_{i} - T||^{2} + \lambda \mathop \sum \limits_{i = 1}^{K} d_{j}^{2} \beta_{j}^{2} . $$

(20)

Using block matrix methods and matrix multiplication properties, this is a standard ELM problem, and its analytical solution is:

$$ \beta_{i} = \left( {H_{i}^{T} H_{i} + \lambda d_{i} I} \right)^{ - 1} H_{i}^{T} T. $$

(21)

To optimize the weights and parameters d_i and β_i of each kernel function, one can fix β_i and transform the optimization problem of MKELM into a non-linear programming problem as follows:

$$ \mathop {\min (}\limits_{{d_{i} ,\theta_{i} }} ||f\left( {x,d_{i} ,\theta_{i} } \right) - T||^{2} + \lambda d^{2} ), $$

(22)

where $f\left( {x,d_{i} ,\theta_{i} } \right)$ is the prediction function of MKELM, which is defined as follows:

$$ f\left( {x,d_{i} ,\theta_{i} } \right) = \mathop \sum \limits_{i = 1}^{K} d_{i} H\left( {x,\theta_{i} } \right)\beta_{i} , $$

(23)

where θ_i represents the parameters of the ith kernel function, and $H\left( {x,\theta_{i} } \right)$ denotes the hidden layer output matrix corresponding to the ith kernel function. Utilizing the Lagrange multiplier method and the Karush–Kuhn–Tucker (KKT) conditions, this optimization problem is a convex quadratic programming problem that can be solved using effective algorithms such as the gradient descent method, Newton’s method, etc. This paper has adopted an improved differential evolution algorithm to solve this problem, which is introduced as follows.

3.3 Parameters Optimization for MKELM

The performance and stability of MKELM depend on the selection of multiple parameters, including kernel function weights d and the parameters θ associated with each kernel function. These parameters cannot be obtained analytically and require optimization using heuristic search algorithms. In this study, we employ an improved differential evolution algorithm to optimize these parameters.

(1) Initialization: Randomly generate an initial population X = {x₁, x₂,…,x_N} that contains N individuals, where each individual x_i = {d₁, d₂,…,d_K, θ₁, θ₂,…,θ_M} is a vector containing all parameters, θ_m represents the parameter of the mth kernel function, M represents the total number of kernel function parameters. At the same time, set the algorithm’s control parameters, such as mutation factor F, crossover probability CR, maximum iterations G_max, etc.

(2) Evaluation: Calculate the fitness value f(x_i) of each individual as an evaluation indicator of its optimization goal. This paper uses root mean square error (RMSE) as a fitness function, that is:

$$ f\left( {x_{i} } \right) = RMSE = \sqrt {\frac{1}{N}\mathop \sum \limits_{j = 1}^{N} \left( {y_{i} - t_{i} } \right)^{2} } , $$

(24)

where y_j is the prediction value of MKELM for the jth sample, and t_j is the actual value of the jth sample.

(3) Mutation: For each individual x_i, randomly select three different individuals x_a, x_b, x_c, and generate a mutation vector v_i, as follows:

$$ v_{i} = x_{a} + F\left( {x_{b} - x_{c} } \right), $$

(25)

where F is the mutation factor, controlling the degree of mutation, which has a significant effect on the convergence speed and precision of the algorithm. Generally, when the mutation factor is larger, the algorithm has a stronger global search capability, but it tends to skip the optimal solution; when the mutation factor is smaller, the algorithm has a stronger local search capability, but it tends to fall into local optimum. To balance the abilities of global search and local search, an adaptive mutation factor strategy is adopted in this paper.

$$ F_{i} = F_{{{\text{min}}}} + \frac{{F_{{{\text{max}}}} - F_{{{\text{min}}}} }}{{1 + {\text{exp}}\left( { - \tau \left( {D_{i} - \mu_{D} } \right)} \right)}}, $$

(26)

where F_i is the mutation factor of the ith individual, F_min and F_max are the minimum and maximum values of the mutation factor, which in this paper are set as F_min = 0.1 and F_max = 0.9; τ is a regulating parameter, in this paper, it is set as τ = 0.5; D_i is the Euclidean distance between the ith individual and the average individual of the population; μ_D is the average Euclidean distance between all individuals in the population and the average individual. Consequently, a higher population diversity results in a larger mutation factor, thereby bolstering the algorithm’s global search capabilities. Conversely, a lower population diversity leads to a smaller mutation factor, thus enhancing the algorithm’s local search capabilities.

(4) Crossover: For each mutation vector v_i and its corresponding target vector x_i, perform a crossover operation according to a certain crossover probability CR, generating a trial vector u_i, as follows:

$$ u_{{ij}} = \left\{ \begin{gathered} v_{{ij}} ,\,\,\,\,{\text{rand}}_{j} \le CR\,\,or\,\,j = j_{r} \hfill \\ x_{{ij}} ,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,otherwise \hfill \\ \end{gathered} \right. $$

(27)

where rand_j is a random number between [0, 1], j_r is a random integer between [1, K+M], ensuring that at least one parameter is crossed. The crossover probability is a parameter in the DE algorithm that controls the frequency of the crossover operation, which has an important impact on the exploration and exploitation ability of the algorithm. In order to balance the exploration and exploitation ability, this paper adopts a chaos mapping crossover probability strategy, which dynamically adjusts the size of the crossover probability according to the chaos mapping sequence. Specifically:

$$ CR_{i} = \frac{{CR_{{{\text{max}}}} - CR_{{{\text{min}}}} }}{2}\left( {1 + {\text{sin}}\left( {2\pi z_{i} } \right)} \right), $$

(28)

where CR_min and CR_max are the minimum and maximum of the crossover rate, and z_i is a chaotic mapping sequence, its iterative formula is:

$$ z_{i + 1} = 4z_{i} \left( {1 - z_{i} } \right). $$

(29)

(5) Selection: For each trial vector u_i and its corresponding target vector x_i, compare their fitness values f(u_i) and f(x_i). If f(u_i) < f(x_i), then replace x_i with u_i; otherwise, keep x_i.

$$ x_{i, G + 1} = \{ \begin{array}{*{20}c} {u_{i} ,} & {f\left( {u_{i} } \right) \le f\left( {x_{i} } \right)} \\ {x_{i} ,} & {{\text{otherwise}}} \\ \end{array} , $$

(30)

where G is the current iteration, and G+1 is the next iteration.

(6) Termination condition: If the maximum iteration number G_max is reached or the fitness values of all individuals in the population are less than a certain threshold ε, then stop the iteration; otherwise, return to step (3).

(7) Output results: Choose the individual with the smallest fitness value from the final population as the optimal solution.

$$ x^{*} = {\text{arg}}\mathop {{\text{min}}}\limits_{x} f\left( x \right), $$

(31)

where x* = {d₁*, d₂*,…,d_K*, θ₁*, θ₂*,…,θ_M*} is the optimal parameter vector. Use these parameters to construct the MKELM model, which will then be employed for short-term electric load forecasting.

3.4 Parameter Optimization for MKELM

Utilizing the FRSCA method for feature extraction, the diagram below depicts the workflow of short-term electrical load forecasting using an MKELM, which is optimized by an enhanced DEA (see Fig. 1).

4 Experimental Design and Results Analysis

4.1 Experimental Data and Platform

To validate the proposed feature selection and forecasting methods, this study obtained load, electricity price, and meteorological data from the open power system data (OPSD) platform [34]. This platform provides power system data from various European countries, which can be used for experiments related to load forecasting and power dispatching. The dataset includes power and meteorological data from European countries collected at intervals of 15, 30, and 60 min, such as actual load, forecasted load, electricity price data, average temperature, direct radiation, and diffuse radiation. For feature selection and load forecasting experiments, this paper utilizes data from the year 2017 for Great Britain (GB) on a 60-min scale, totaling 8760 data points, primarily comprising historical load data and meteorological data. Before feature selection and load forecasting, all data undergo anomaly detection and processing. The approach adopted for all anomalous and missing data in this paper is linear averaging. To reduce algorithm runtime and accelerate convergence, all types of data are normalized to the [0, 1] range.

Based on the above data, this paper primarily conducts two types of load forecasting models. The first is a 1-h ahead prediction, where each forecast uses real values as input to predict the average load for the next moment. The second type is day-ahead prediction, which mainly forecasts the average load demand within the next 24 h. During the forecasting process, the forecasted values replace the known actual loads in the test data as input for the model. The simulation experiments in this paper are conducted on the following laptop platform, with a software platform of Windows 11+Matlab 2016b, and a hardware platform consisting of a Core i7 2.5G processor + 16 GB RAM + 500 GB SSD.

4.2 Evaluation Metrics

To assess the accuracy and stability of the forecasting methods, this paper employs the following three evaluation metrics:

Root mean square error (RMSE):

$$ {\text{RMSE}} = \sqrt {\frac{1}{N}\mathop \sum \limits_{i = 1}^{N} (y_{i} - t_{i} )^{2} .} $$

(32)

Mean absolute percentage error (MAPE):

$$ {\text{MAPE}} = \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} |\frac{{y_{i} - t_{i} }}{{t_{i} }}| \times 100\% . $$

(33)

R-squared value (R²_Score):

$$ R^{2} = 1 - \frac{{\mathop \sum \nolimits_{i} (y_{i} - t_{i} )^{2} }}{{\mathop \sum \nolimits_{i} (\overline{t}_{i} - t_{i} )^{2} }}, $$

(34)

where N is the number of samples, t_i is the actual value, y_i is the predicted value, and $\overline{t}_{i}$ is the average value of the test samples. Among these metrics, the smaller the values of the first two indicators, the more accurate the forecast. An R²_Score closer to 1 indicates a more accurate forecast.

4.3 Load Feature Selection

In constructing short-term load forecasting models, the input features play a decisive role in the predictive outcome. After preprocessing the data, which includes anomaly detection and normalization, the FRSCA method described in this paper is employed to extract suitable input features. For the experimental data with a 1-h time scale, the load is treated as a time series. If the average load at time t is denoted by L_t, then L_t−1, L_t+1 and L_t−24 can represent the load data at the previous instant, the following instant, and the same instant on the previous day, respectively. To predict L_t, the candidate features selected are [L_t−1, L_t−2,…, L_t−168, Temp_t, RadDirect_t, RadDiff_t, isWeekend], where Temp_t, RadDirect_t, RadDiff_t and isWeekend represent the temperature, direct radiation, diffuse radiation, and whether it is the weekend at time t, respectively. Here, isWeekend is a binary variable where 1 indicates a weekend and 0 indicates a weekday. All candidate features are normalized to the [0, 1] interval. The objective of feature extraction is to select an appropriate number of features from the many candidates that contribute significantly to the forecasting model.

In this study, the experimental data are divided into four datasets, with the first three parts containing 2184 data points each, and the fourth part containing 2208 data points. Accordingly, the dates corresponding to each dataset are as follows: Dataset 1 corresponds to data from January 1 to March 1, 2017; Dataset 2 corresponds to dates from March 2 to July 1, 2017; Dataset 3 corresponds to dates from July 2 to September 30, 2017; Dataset 4 corresponds to dates from October 1 to December 31, 2017. Phase space reconstruction is performed on each part of the data, followed by the calculation of the fdep for each candidate feature according to the feature selection steps in Sect. 2.2. Table 1 presents the top 20 candidate features and their corresponding fdep values for each dataset.

Table 1 Features selected in the first phase by FRS

Full size table

The results in Table 1 indicate that the range of fdep values for the selected features is between 0.6676 and 0.8237, suggesting a strong dependency relationship between these features and the target variable. Across the four datasets, temperature (Temp_t) and direct radiation (RadDirect_t) are almost consistently the most important features. Specifically, the temperature feature dominates in parts 1, 2, and 3 of the data, while the direct radiation feature stands out in parts 1, 3, and 4. These two factors significantly influence the load, likely related to the use of electrical devices such as air conditioning, especially during periods of high summer temperatures. Moreover, the consistency of all periodic features (such as L_t−24, L_t−48, L_t−168, etc.) indicates a clear periodicity in the load. This means that the load at the current moment has a strong correlation with the load at the same time on previous days, reflecting the regularity of people’s daily life and work patterns. The inclusion of features with high fuzzy dependency degrees, such as the load at one moment before the prediction (L_t−1) and the load at moments before and after various periodic times (such as L_t−23, L_t−167, etc.), further confirms the autocorrelation of the load. This autocorrelation is significant for predicting future load changes based on historical data.

When constructing a forecasting model, the number of input features significantly affects its predictive performance. Generally, too few input features may not express the non-linear relationship between inputs and outputs well, leading to suboptimal predictions. Conversely, too many input features can result in time-consuming training and may also decrease prediction accuracy. Therefore, to determine an appropriate number of features for the forecasting model’s input, this study selected a range from 5 to 20 features as model inputs and conducted day-ahead load forecasting experiments on Dataset 1 and Dataset 2. The results, as shown in Fig. 2, indicate that the prediction accuracy is relatively high when the number of input features is between 10 and 12. Both fewer and greater numbers of input features did not yield the best prediction outcomes.

Therefore, after selecting preliminary features using the fuzzy rough set method, the correlation coefficient method described in Sect. 2.3 is employed to eliminate some redundant features, thus ensuring the forecasting model has an optimal number of input features. During the redundancy elimination process, the top 20 features based on fuzzy dependency degree were chosen as the initial screening features, with TH₂ set to 0.94. Table 2 presents the final input features for each part of the data obtained through the two-stage feature selection by the FRSCA method, as well as the redundant features that were eliminated. It can be observed from Table 2 that almost all periodic features were retained, while some features before and after the periodic features were considered redundant and eliminated. For instance, in the dataset 1, L_t−73, L_t−71, and L_t−143, L_t−145 were, respectively, considered redundant to L_t−72 and L_t−144 and were, thus, removed.

Table 2 The final selected features in each dataset by FRSCA method

Full size table

Additionally, for comparison purpose, Table 3 presents the features selected by other four different feature extraction methods (CA, MI, MFSC and Relief) in Dataset 1 (the selection results for other datasets are not listed as the top features are essentially the same as in Dataset 1). To have a similar number of feature inputs as the FRSCA method, each method selected the top 13 features based on their respective indices. Comparing the results in Table 3 with those in Table 2, it is observed that except for the MFSC method, which extracted the temperature feature, the other methods failed to effectively extract significant features such as temperature and radiation levels that have a substantial impact on the load forecasting.

Table 3 Features selected by other methods in Dataset 1

Full size table

Table 4 Comparison of the day-ahead load forecasting metric of various algorithms

Full size table

4.4 Load Forecasting Results and Comparative Analysis

In this paper, all forecasting models are built based on feature selection results. The forecasting experiments are divided into two scenarios: Scenario 1 compares the outcomes of the proposed DE-MKELM method with other forecasting techniques such as traditional MKELM (referred to as MKELM hereafter), RBF network, SVR, MLP, and LSTM under the same feature inputs. It is worth noting that in the traditional MKELM, the kernel function parameters are determined via cross-validation, while the kernel function weights are fixed and averaged. For instance, in this study, four kernel functions are used, each with a weight of 0.25. In Scenario 2, the predictive performance of models is compared and analyzed by using different feature selection methods under the same forecasting technique.

In the experiments, day-ahead load forecasting and one-hour-ahead load forecasting tests are conducted for Datasets 1 to 4. Each forecasting model uses 744 training samples (equivalent to 31 days of data), and the model’s effectiveness is verified by predicting the load for the following day based on the training results. For example, Dataset 1 contains 2184 load data points, but since there are 168 total candidate load features for feature extraction, only 2016 sample pairs can be constructed from this dataset. At this point, the output samples of the first 744 training samples in the dataset correspond to the load between January 8 and February 7, 2017, while the output of the prediction samples corresponds to the load on February 8, 2017. In the DE-MKELM method, the population size N of the DE algorithm is 50, and G_max is 200.

In Test Scenario 1, the FRSCA method was initially employed for feature extraction from different datasets. Based on these features, the DE-MKELM method proposed in this study, along with other common forecasting models including MKELM, SVR, RBF, MLP, and LSTM, were applied to construct models for day-ahead forecasting and one-hour-ahead forecasting. Each method was independently executed 10 times to ensure the stability and reliability of the results. Figures 3 and 4 display the best single prediction results (i.e., the smallest MAPE values) obtained by each model for different datasets across these 10 runs. Tables 4, and 5 detail the average prediction errors (MAPE and RMSE) and accuracy (R²_Score) for each method across the 10 runs in each dataset.

Table 5 Comparison of the hour-ahead load forecasting metric of various algorithms

Full size table

As observed from Tables 4 and 5, in day-ahead load forecasting, the proposed DE-MKELM method demonstrates the best performance across all four datasets under the same input features in terms of MAPE, RMSE, and R² Score. It boasts the lowest values for MAPE and RMSE, and the highest for R² Score, indicating its superior prediction accuracy, minimal error, and enhanced model interpretability. The traditional MKELM method ranks second in forecasting accuracy, demonstrating that the MKELM method has certain advantages in load forecasting. This also indicates the effectiveness of using the DE algorithm to optimize MKELM’s kernel parameters and weights. Furthermore, the performances of the SVR, RBF, and MLP methods are at a similar level; however, none of them outperform the DE-MKELM. LSTM demonstrates the weakest performance in day-ahead load forecasting, especially on Dataset 1, with MAPE and RMSE values of 5.5629% and 2463.45, respectively, and an R² Score of only 0.8089. It is further observed that while there is a significant difference in MAPE and RMSE among the various forecasting methods, the R² Scores are not vastly different.

In the one-hour-ahead load forecasting, the proposed DE-MKELM similarly shows the most optimal results across all four datasets, with all methods’ forecasting indicators surpassing those of the day-ahead forecast, which is visually evident in Figs. 3 and 4. This is attributed to the need in day-ahead forecasting to replace the actual values in the test input with each predicted load. The traditional MKELM method still ranks second. And the other four methods’ one-hour-ahead forecasting indicators do not differ greatly, but overall, the SVR, RBF, and MLP methods slightly outperform LSTM, which may be due to LSTM’s weaker suitability for this type of forecasting task, or a need for further hyperparameter tuning to enhance its performance.

In the test of Scenario 2, the same forecasting method (DE-MKELM) is applied to different feature selection methods. To ensure the fairness of the experiments, the forecast days for the four datasets are kept consistent with Scenario 1. Figures 5 and 6 present the day-ahead and hour-ahead forecasting results with the smallest MAPE obtained using different feature selection methods for the four datasets. The results indicate that the accuracy of hour-ahead forecasting is generally superior to that of day-ahead forecasting. This is primarily because hour-ahead forecasting involves only single-step predictions, where each prediction directly utilizes the true values as input, while day-ahead forecasting resembles multi-step predictions, requiring the load values obtained from each prediction to be used as input for subsequent forecasts. This may lead to error accumulation, thereby reducing prediction accuracy. Although Figs. 5 and 6 provide an intuitive comparison of forecasting results, to more comprehensively evaluate the performance of various feature selection methods, Tables 5 and 6 detail the average values of indicators from ten independent runs.

Table 6 Comparison of day-ahead load forecasting metric of various feature selection methods

Full size table

The results from day-ahead forecasting presented in Table 6 reveal that the FRSCA method exhibits unparalleled superiority in the MAPE metric across all datasets, demonstrating its high precision in load forecasting. Specifically, its MAPE values range from 1.0545% to 2.6551%, significantly lower than the other methods, particularly the Relief method, which shows the highest MAPE values. The RMSE metric further supports this finding, with FRSCA displaying the lowest errors, thereby confirming its predictive accuracy. FRSCA’s R2_Score consistently remains above 0.97 across all datasets, reflecting its robustness in explaining the variance in load data. In Table 7, for the hour-ahead load forecasting, FRSCA once again exhibits the best performance, especially in Dataset 1, where its R²_Score reaches a high of 0.9972. Despite fluctuations in MAPE and RMSE across different datasets, FRSCA maintains a high R²_Score, indicating its consistency. The Relief method shows improvement in hour-ahead forecasting compared to day-ahead forecasting but still ranks lowest among the evaluated methods.

Table 7 Comparison of hour-ahead load forecasting metric of various feature selection methods

Full size table

To further validate the superiority of the methods employed in this paper, an Analysis of Variance (ANOVA) was conducted on the ten runs of each forecasting method, followed by the Tukey–Kramer method [35] for subsequent multiple comparison. In Scenario 1, six forecasting models are compared, hence the numerator degrees of freedom df₁ = 6 − 1 = 5, and the denominator degrees of freedom df₂ = 6×10 − 6 = 54. The critical F-value in Scenario 1 is: F_critical = finv(1−α, df₁, df₂) = 2.3861. In Scenario 2, five feature selection methods are compared, hence the numerator degrees of freedom df₁ = 5 − 1 = 4, and the denominator degrees of freedom df₂ = 5×10 − 5 = 45. The critical F-value in this Scenario is: F_critical = finv(1−α, df₁, df₂) = 2.5787. In both scenarios, the significance level α is set at 0.05.

In test Scenario 1, the ANOVA results of each dataset are shown in Table 8. It provides the F-Value and p-Value for each dataset in the day-ahead load forecasting and hour-ahead load forecasting under this scenario. Similarly, in Test Scenario 2, the ANOVA test results for each dataset are also given in Table 9.

Table 8 The ANOVA results in test Scenario 1 (α = 0.05, F_critical = 2.3861)

Full size table

Table 9 The ANOVA results in test Scenario 2 (α = 0.05, F_critical = 2.5787)

Full size table

From Tables 8 and 9, it can be observed that in both test scenarios (Scenario 1 and Scenario 2), whether it is day-ahead forecasting or one-hour-ahead forecasting, the ANOVA results across different datasets show that the F-values are significantly larger than the Fcritical values, and the p-values are much smaller than the significance level α (0.05). This indicates that there are significant differences between different forecasting models and feature selection methods. However, despite the ANOVA results indicating these significant differences, they do not directly imply that the DE-MKELM and FRSCA methods proposed in this paper perform significantly better than the other methods. Therefore, further multiple comparison analyses are needed to clarify this. To begin, we will analyze the multiple comparison results of day-ahead forecasting in Test Scenario 1.

In Figs.7 and 8, the results of a multiple comparison for day-ahead forecasting models using MAPE and RMSE metrics are meticulously illustrated. Using the proposed DE-MKELM as the reference baseline (red dashed line), the other five methods—MKELM, SVR, RBF, MLP, and LSTM—were compared against the proposed DE-MKELM. Consequently, the mean differences and Tukey–Kramer confidence intervals (CIs) for these five methods were plotted in each sub-figure. If the CI line of a comparison method does not intersect with the baseline, it indicates that the results of the DE-MKELM method are significantly different from that comparison method. Therefore, it can be observed from the results that the DE-MKELM method demonstrates a significant advantage in both MAPE and RMSE metrics compared to other algorithms, thereby validating the significance and superiority of the proposed method in terms of forecasting accuracy.

In Fig. 9, the multiple comparison of the R²_Score metric obtained from day-ahead predictions using different forecasting models is also presented. From Table 8, it can be seen that although there are significant differences among the six algorithms across all four datasets (with p < 0.05 and F-Value > 2.3861), the multiple comparisons of the R²_Score metric in Fig. 9 indicate that the confidence interval lines intersect with the reference baseline in all four datasets. For instance, the confidence interval lines for MKELM, SVR, RBF, and MLP in Dataset 1, as well as MKELM and MLP in Dataset 2, intersect with the reference baseline. This suggests that their R²_Score results do not significantly differ from those of the proposed DE-MKELM. This observation is also reflected in Tables 3, 4, where the average R2_Scores for the proposed MKELM, SVR, RBF, and MLP in Dataset 1 are 0.9947, 0.9639, 0.9717, and 0.9660, respectively, showing no marked difference. In fact, the R²_Score measures the predictive values’ ability to explain the variability of the actual values, rather than simply measuring the gap between the predicted and actual values. A higher R²_Score indicates that the model captures the data trend well, but this does not necessarily mean that the model’s predictions are accurate (i.e., the error is small). For example, a model may fit the data well with a high R²_Score, but it could also have a large mean absolute percentage error (MAPE) and variance (RMSE). Conversely, another model may have a small prediction error but a low R²_Score because it may not capture all the variability in the data as effectively.

To enhance the understanding of the efficiency of the methods proposed in this study, this paper will discuss in detail the multiple comparison outcomes for the hour-ahead predictions in Scenario 1 and the two types of predictions in Scenario 2, with specific illustrations distributed from Figs. 10, 11, 12, 13, 14, 15. It is noteworthy that, given the R²_Score primarily evaluates the fit of variability between predicted and actual values, its ANOVA results are not discussed in this analysis. Instead, an in-depth exploration of algorithm performance is conducted solely based on the MAPE and RMSE metrics.

Figures 10 and 11 present the ANOVA and multiple comparison results for several forecasting algorithms in hour-ahead predictions in Scenario 1, based on the MAPE and RMSE metrics. These results indicate that hour-ahead predictions exhibit more significant differences than day-ahead load forecasting, as evidenced by lower p-values and higher F-Values. Multiple comparisons across various datasets underscore the significant advantages of the proposed MKELM algorithm over the other four reference algorithms. Additionally, Figs. 12, 13, 14, 15 detail the ANOVA results for day-ahead and hour-ahead predictions’ MAPE and RMSE under Scenario 2, with different feature selection methods, offering insights into the role of feature selection strategies in improving the accuracy of forecasting models.

The results from Figs. 12, 13, 14, 15 indicate that in the comparison of feature selection algorithms, tests across all datasets showed smaller p-values and larger F-Values, signifying statistically significant differences between the outcomes of each feature selection method. The multiple comparison results, for both day-ahead and hour-ahead predictions, consistently demonstrate that the FRSCA method proposed in this paper significantly outperforms the other four feature selection methods across all datasets. Moreover, in the predictions of Scenario 2, the confidence intervals for the compared feature selection methods are shorter, particularly in Datasets 1 and 2. The length of confidence intervals is commonly used to measure the uncertainty of prediction results. Shorter confidence intervals imply more concentrated prediction outcomes, indicating more consistent and stable model predictions. Therefore, this phenomenon suggests that under the same input feature conditions, the variance between the ten prediction results using the proposed MKELM method is smaller, further proving the strong stability and reliability of the proposed MKELM method.

5 Conclusion

This study proposes a forecasting method for electric load forecasting that integrates FRS theory with MKELM optimized by DEA (DE-MKELM). The introduction of FRS theory offers an effective feature selection tool that enhances feature relevance and reduces computational complexity through a two-stage filtering process. The proposed DE-MKELM constructs a robust forecasting model, with a multi-kernel structure and parameter optimization strategy that adapts well to feature variations.

Simulation experiments utilized actual electric load data from the OPSD platform for validation. Results indicate that the proposed DE-MKELM method excels in short-term electric load forecasting, showing significant advantages over common prediction methods in MAPE, RMSE, and R²_Score metrics. The FRSCA method captures non-linear relationships and interactions between features, extracting periodic, short-term trends, temperature, and radiation features while eliminating redundant features. Results confirm that features extracted using the proposed method achieve superior prediction accuracy. ANOVA and multiple comparisons further validate the statistical significance and superiority of the proposed methods.

Future research may optimize feature selection strategies and prediction models, including the introduction of Deep Learning (DL) methods and predictions for complex electric power system data, encompassing distributed generation and renewable energy integration. These efforts aim to provide robust technical support for the stable operation and intelligent management of power systems.

Data Availability

Not applicable.

Abbreviations

FRS:: Fuzzy rough set
MKELM:: Multi-kernel extreme learning machine
DEA:: Differential evolution algorithm
RBF:: Radial basis function
SVR:: Support vector regression
MLP:: Multilayer perceptron
LSTM:: Long short-term memory network
CA:: Correlation analysis
FRSCA:: Fuzzy rough set-correlation analysis
MFSC:: Multivariate feature selection criterion
MI:: Mutual information

References

Pinheiro, M.G., Madeira, S.C., Francisco, A.P.: Short-term electricity load forecasting—a systematic approach from system level to secondary substations. Appl. Energy 332, 120493 (2023)
Article Google Scholar
Wan, A., Chang, Q., Khalil, A.L.B., et al.: Short-term power load forecasting for combined heat and power using CNN-LSTM enhanced by attention mechanism. Energy 282, 128274 (2023)
Article Google Scholar
Mounir, N., Ouadi, H., Jrhilifa, I.: Short-term electric load forecasting using an EMD-BI-LSTM approach for smart grid energy management system. Energy Build. 288, 113022 (2023)
Article Google Scholar
Li, S., Kong, X., Yue, L., et al.: Short-term electrical load forecasting using hybrid model of manta ray foraging optimization and support vector regression. J. Clean. Prod. 388, 135856 (2023)
Article Google Scholar
Yazici, I., Beyca, O.F., Delen, D.: Deep-learning-based short-term electricity load forecasting: a real case application. Eng. Appl. Artif. Intell. 109, 104645 (2022)
Article Google Scholar
Bashir, T., Haoyong, C., Tahir, M.F., et al.: Short term electricity load forecasting using hybrid prophet-LSTM model optimized by BPNN. Energy Rep. 8, 1678–1686 (2022)
Article Google Scholar
Hagan, M.T., Behr, S.M.: The time series approach to short term load forecasting. IEEE Trans. Power Syst. 2(3), 785–791 (1987)
Article Google Scholar
Dhaval, B., Deshpande, A.: Short-term load forecasting with using multiple linear regression. Inter. J. Electr. Comput. Eng. 10(4), 3911–3917 (2020)
Google Scholar
Wu, F., Cattani, C., Song, W., et al.: Fractional ARIMA with an improved cuckoo search optimization for the efficient Short-term power load forecasting. Alex. Eng. J. 59(5), 3111–3118 (2020)
Article Google Scholar
Xia, C., Wang, J., McMenemy, K.: Short, medium and long term load forecasting model and virtual load forecaster based on radial basis function neural networks. Int. J. Electr. Power Energy Syst. 32(7), 743–750 (2010)
Article Google Scholar
Xiao, Z., Ye, S.J., Zhong, B., et al.: BP neural network with rough set for short term load forecasting. Expert Syst. Appl. 36(1), 273–279 (2009)
Article Google Scholar
Dudek, G.: Multilayer perceptron for short-term load forecasting: from global to local approach. Neural Comput. Appl. 32(8), 3695–3707 (2020)
Article Google Scholar
Luo, J., Hong, T., Gao, Z., et al.: A robust support vector regression model for electric load forecasting. Int. J. Forecast. 39(2), 1005–1020 (2023)
Article Google Scholar
Li, S., Goel, L., Wang, P.: An ensemble approach for short-term load forecasting by extreme learning machine. Appl. Energy 170, 22–29 (2016)
Article Google Scholar
Tian, C., Ma, J., Zhang, C., et al.: A deep neural network model for short-term load forecast based on long short-term memory network and convolutional neural network. Energies 11(12), 3493 (2018)
Article Google Scholar
Ding, S., Xu, X., Nie, R.: Extreme learning machine and its applications. Neural Comput. Appl. 25, 549–556 (2014)
Article Google Scholar
Wang, J., Lu, S., Wang, S.H., et al.: A review on extreme learning machine. Multimed. Tools Appl. 81(29), 41611–41660 (2022)
Article Google Scholar
Xie, Z., Wu, Z.: Maximum power point tracking algorithm of PV system based on irradiance estimation and multi-Kernel extreme learning machine. Sustain. Energy Technol. Assess. 44, 101090 (2021)
Google Scholar
Kongsorot, Y., Horata, P., Musikawan, P., et al.: Kernel extreme learning machine based on fuzzy set theory for multi-label classification. Int. J. Mach. Learn. Cybern. 10, 979–989 (2019)
Article Google Scholar
Ahuja, B., Vishwakarma, V.P.: Deterministic multikernel extreme learning machine with fuzzy feature extraction for pattern classification. Multimed. Tools Appl. 80(21), 32423–32447 (2021)
Article Google Scholar
Naz, S., Tahir, M.H., Jamal, F., et al.: A group acceptance sampling plan based on flexible new Kumaraswamy exponential distribution: an application to quality control reliability. Cogent Eng. 10(2), 2257945 (2023)
Article Google Scholar
Hussain, N., Tahir, M.H., Jamal, F., et al.: An acceptance sampling plan for the odd exponential-logarithmic Fréchet distribution: applications to quality control data. Cogent Eng. 11(1), 2304497 (2024)
Article Google Scholar
Koprinska, I., Rana, M., Agelidis, V.G.: Correlation and instance based feature selection for electricity load forecasting. Knowl. Based Syst. 82, 29–40 (2015)
Article Google Scholar
Wang, C., Qi, Y., Shao, M., et al.: A fitting model for feature selection with fuzzy rough sets. IEEE Trans. Fuzzy Syst. 25(4), 741–753 (2016)
Article Google Scholar
Doquire, G., Verleysen, M.: Mutual information-based feature selection for multilabel classification. Neurocomputing 122, 148–155 (2013)
Article Google Scholar
Li, K., Fard, N.: A novel nonparametric feature selection approach based on mutual information transfer network. Entropy 24(9), 1255 (2022)
Article MathSciNet Google Scholar
Cui, X., Li, Y., Fan, J., et al.: A novel filter feature selection algorithm based on relief. Appl. Intell. 52(5), 5063–5081 (2022)
Article Google Scholar
Labani, M., Moradi, P., Ahmadizar, F., et al.: A novel multivariate filter method for feature selection in text classification problems. Eng. Appl. Artif. Intell. 70, 25–37 (2018)
Article Google Scholar
Bouchlaghem, Y., Akhiat, Y., Amjad, S.: Feature selection: a review and comparative study//E3S web of conferences. EDP Sciences, vol. 351, pp. 01046 (2022)
Som, T., Shreevastava, S., Tiwari, A.K., et al.: Fuzzy rough set theory-based feature selection: a review. In: Mathematical Methods in Interdisciplinary Sciences, pp. 145–166. John Wiley & Sons, Hoboken (2020)
Chapter Google Scholar
Li, Y., Wei, S., Liu, X., et al.: A novel robust fuzzy rough set model for feature selection. Complexity 2021, 1–12 (2021)
Google Scholar
Chen, D., Hu, Q., Yang, Y.: Parameterized attribute reduction with Gaussian kernel based fuzzy rough sets. Inf. Sci. 181(23), 5169–5179 (2011)
Article Google Scholar
Lu, S., Wang, X., Zhang, G., et al.: Effective algorithms of the Moore-Penrose inverse matrices for extreme learning machine. Intell. Data Anal. 19(4), 743–760 (2015)
Article Google Scholar
Wiese, F., Schlecht, I., Bunke, W.D., et al.: Open power system data-frictionless data for electricity system modelling. Appl. Energy 236, 401–409 (2019)
Article Google Scholar
Driscoll, W.C.: Robustness of the ANOVA and Tukey-Kramer statistical tests. Comput. Ind. Eng. 31(1–2), 265–268 (1996)
Article Google Scholar

Download references

Funding

This paper was supported by the Guizhou Provincial Science and Technology Foundation of China (QKHJC-ZK[2022] General 178) and the Academic Leader Fund of Guizhou Open University.

Author information

Authors and Affiliations

School of Aviation, Guizhou Open University, Guiyang, 550023, China
Gang Liu, ZhiYuan Jin & HongZhong Yu

Authors

Gang Liu
View author publications
You can also search for this author in PubMed Google Scholar
ZhiYuan Jin
View author publications
You can also search for this author in PubMed Google Scholar
HongZhong Yu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

G Liu: conceptualization, methodology, coding and writing. ZY Jin: data curation and writing. HZ Yu: visualization and investigation.

Corresponding author

Correspondence to Gang Liu.

Ethics declarations

Conflict of Interest

There is no conflict of interest in the submission of this manuscript.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Liu, G., Jin, Z. & Yu, H. Short-Term Electrical Load Forecasting Based on Fuzzy Rough Set Feature Selection and Multi-kernel Extreme Learning Machine. Int J Comput Intell Syst 17, 160 (2024). https://doi.org/10.1007/s44196-024-00577-7

Download citation

Received: 23 April 2024
Accepted: 16 June 2024
Published: 24 June 2024
DOI: https://doi.org/10.1007/s44196-024-00577-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Short-Term Electrical Load Forecasting Based on Fuzzy Rough Set Feature Selection and Multi-kernel Extreme Learning Machine

Abstract

Similar content being viewed by others

Short-Term Forecast Based on Generalized Maximum Correntropy Criterion and Kernel Extreme Learning Machine

Short-term load forecasting method based on fuzzy optimization combined model of load feature recognition

A study of hybrid data selection method for a wavelet SVR mid-term load forecasting model

1 Introduction

2 Principle of Load Feature Selection Based on Fuzzy Rough Set Theory

2.1 Basic Concepts of Fuzzy Rough Set Theory

Definition 1

Definition 2

Definition 3

Definition 4

2.2 Feature Selection for Load Using FRS

2.3 Redundant Feature Elimination

3 Multi-kernel Extreme Learning Machine for Load Forecasting

3.1 Basic Concepts of ELM

3.2 The Principle of MKELM

3.3 Parameters Optimization for MKELM

3.4 Parameter Optimization for MKELM

4 Experimental Design and Results Analysis

4.1 Experimental Data and Platform

4.2 Evaluation Metrics

4.3 Load Feature Selection

4.4 Load Forecasting Results and Comparative Analysis

5 Conclusion

Data Availability

Abbreviations

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation