Introduction

Time series forecasting (TSF) is the process of analyzing time series data using statistics and modeling to make predictions and inform straegic decision-making. TSF plays a vital role in various domains, especially in the fields of financial management1, 2, social network3,4,5, medical science6,7,8, and industrial engineering9, 10. Therefore, there is a growing consensus that it is of great importance to enhance the accuracy and interpretability of TSF due to its widespread application.

The conventional techniques for time series data analysis have assumed a linear relationship between past and future values for the purpose of prediction. This kind of model is represented by the linear regression approach, such as ARIMA-based models11, and has shown good results for short-term predictions. However, their performance deteriorates significantly when the parameters are not selected properly, and the arima based model is not suitable for time series data with weak periodic characteristics6, 8. Chen et al.5 propose a dynamic linear model to extract the systematical time-series dynamic and volatility features to achieve more accurate prediction, and achieved some success. Today, deep learning-based models have generated great success, especially in long-term time series prediction tasks, where deep learning models significantly outperform conventional linear models. Artificial neural network (ANN) has been a powerful tool for TSF by virtue of its universal approximation capability, nonlinear modeling capability12, 13. Recurrent neural network (RNN)14 can use internal memory cells to handle temporal data, and is employed to model history and future states15,16,17. However, the gradient disappearance problem of the RNN makes its performance limited. To overcome this problem, long short-term memory (LSTM) constructs the “gate” to determine the remembering and forgetting information to obtain the long-term memory of historical data. Ma et al.18 combine LSTM and bidirectional LSTM networks to perform transportation prediction. Bandara et al.19 propose a decomposition-based unified network architecture (LSTM-MSNet) to predict multiple seasonal time series. LSTM-based models, however, are notorious for their limited ability to effectively utilize the latest data and accurately model long sequences and cycles in time series. Additionally, the performance of LSTM for long-term prediction is hindered by the amplification of small errors inherent in the model. Transformer-based time series model is currently a popular research direction, and its modeling ability is incomparable to traditional neural networks. Its inherent advantages in processing and predicting long sequence data make it perform excellently in most temporal tasks20, 21. However, its O(\(n^2\)) algorithm complexity leads to an explosive increase in memory usage when executing long sequence data. Therefore, current research on it mostly focuses on optimizing algorithm efficiency and reducing algorithm complexity22,23,24.

Fuzzy system characterized by universal approximation capability and outstanding interpretability, providing an effective paradigm for handling uncertain data, representing latent knowledge, and exhibiting the inference process25. Some models attempt to enhance deep learning-based models by embedding fuzzy set theory, such as fuzzy deep convolutional neural network26, deep fuzzy echo state network27, and fuzzy recurrent neural network28, etc. Especially for time series forecasting, the related works have been proposed to integrate the fuzzy system with LSTM carrying the advantages of both fuzzy logic and deep learning. Li et al.29 propose a Type-2 fuzzy LSTM neural network to perform traffic volume prediction. Tang et al.30 propose a granule time series forecasting model by integrating the trend fuzzy granule and LSTM network. The innovations of these models are mainly reflected in taking the fuzzy information as input data and training the network’s parameters with a fuzzy system. However, the overall structure of LSTM has not changed, and the interpretability of the fuzzy system has been shrunken to some extent.

The extraction of fuzzy rules from the training data is a crucial component in modeling a fuzzy system. The Wang–Mendel (WM) model is a powerful tool for directly extracting fuzzy rules using only one pass of the training data31, 32. The effectiveness of the WM model could be greatly degraded due to the excessive generation of fuzzy rules. To overcome this issue, an improved WM model utilizing fuzzy c-means is proposed33. But, there is still a challenging task to determine the number of clustering. Zhai et al.34 propose an on-line WM fuzzy inference model, which can adaptively acquire the fuzzy rules from training data. However, the performance of the model could be limited due to the redundant rules and the lacking rules not covered by the training data.

Taking all the above observations into consideration, this paper proposes a fuzzy inference-based LSTM for time series forecasting, which enhance the accuracy and interpretability of LSTM with by embedding fuzzy system. To improve the computational efficiency and completeness of WM model, a fuzzy rule base construction method based on WM model is proposed. Then, the fuzzy prediction model based on the improved WM model is constructed. Finally, the fuzzy inference-based LSTM is proposed to carry out prediction by integrating the fuzzy prediction fusion, the strengthening memory layer, and the parameter segmentation sharing strategy into the LSTM network. In summary, the main contributions of this model are as follows:

  1. (i)

    A fast and complete fuzzy rule construction method based on the WM model is proposed, which can enhance the computational efficiency and completeness of the WM model by fuzzy rules simplification and complement strategies.

  2. (ii)

    Strengthening memory layer is constructed by integrating the current output with the cell state, which can strengthen the long-term memory and alleviate the gradient dispersion problem of LSTM.

  3. (iii)

    Parameter segmentation sharing strategy by dividing the overall output layer into different parts is proposed, which can balance processing efficiency and architecture discrimination.

  4. (iv)

    Fuzzy inference-based LSTM with the embedding of a fuzzy system is proposed, which can enhance the accuracy and interpretability of LSTM for long-term time series prediction.

  5. (v)

    Extensive experiments demonstrate the better performance of the proposed method in comparison with related models.

Prerequisites

This paper focuses on improving the interpretability and accuracy of deep neural network based on the fuzzy inference model in tackling the time series prediction problem. This section mainly introduces two related methods, LSTM and WM model.

Long Short-Term Memory Neural Network (LSTM)

RNN has achieved good performance in processing and learning time series information, but it cannot successfully learn long-term dependencies due to gradient explosion or gradient disappearance problems. LSTM is an extension for RNNs, which introduces the “gate” cell to retain and learn long-term dependencies. LSTM network can capture important features from inputs and store the information over a long period of time, thus it has achieved good results in long-term forecasting. In general, the critical components of LSTM network architecture consists of three gates: forget, input, and output gates denoted by f, i and o, respectively. The detailed description of the calculation procedure for each gate is shown as follows:

(1) Forget Gate. Determine what information needs to be retained in the memory cell with the help of sigmoid function. The output is expressed as follows:

$$\begin{aligned} f_t = \sigma (W_{fx}\cdot x_t+W_{fh} \cdot h_{t-1}+b_f) \end{aligned}$$
(1)

where \(x_t\) and \(h_{t-1}\) represent input and hidden state at time step t and \(t-1\), respectively. W represents weight matrices, \(b_f\) represents a constant bias, and \(\sigma (\cdot )\) represents sigmoid function.

(2) Input Gate. Determine whether the new information should be saved to the memory cell by the sigmoid layer and tanh layer. The outputs of the two layers are computed in the following form:

$$\begin{aligned} i_t = \sigma (W_{ix}\cdot x_t+W_{ih} \cdot h_{t-1}+b_i) \end{aligned}$$
(2)
$$\begin{aligned} \tilde{c}_t = \textrm{tanh}(W_{cx}\cdot x_t+W_{ch} \cdot h_{t-1}+b_c) \end{aligned}$$
(3)

The update of the memory cell is achieved by the combination of these two layers, where the current memory is obtained by retaining previous information and introducing new cell state information. The mathematical equation is expressed in the following form:

$$\begin{aligned} c_t=f_t \circ c_{t-1}+i_t \circ \tilde{c}_t \end{aligned}$$
(4)

where \(c_{t}\) represents cell state at time step t, \(\circ\) denotes the Hadamard product.

3) Output Gate. Determine what part of the memory contributes to the current put and map the output between \(-1\) and 1 by tanh function. The outputs can be computed by the following equations:

$$\begin{aligned} o_t = \sigma (W_{ox}\cdot x_t+W_{oh} \cdot h_{t-1}+b_o) \end{aligned}$$
(5)
$$\begin{aligned} h_t=o_t \cdot \textrm{tanh}(c_t) \end{aligned}$$
(6)

Wang–Mendel model

WM model is a simple and powerful tool for generating the fuzzy rule base from sample data. However, the effectiveness of the WM model could be greatly degraded due to the huge amount of data. Each training data generates a fuzzy rule resulting that the rule extraction strategy is not efficient enough. Thus, improving the rule generation mechanism becomes crucial, and a fast and complete fuzzy rule base construction method based on the WM model will be proposed. The simplification strategy for redundant rules and conflict rules is proposed to simplify the fuzzy rule base. The complement strategy is proposed to complement the fuzzy rule base.

Fuzzy rules extraction. Given the time series, define the length of the antecedents and consequents of the fuzzy rule, and divide several fuzzy subsets of each variable of the antecedents to extract input-output sample pairs. Each feature of the input-output sample pair can be assigned to the fuzzy set with the highest membership degree, and these membership degrees are used to calculate the weight of the fuzzy rule, and finally an unorganized fuzzy rule base is generated.

Fuzzy rule arrangement. When the sample size is large, it is easy to generate redundant rules. To solve the above problem, when adding a rule to the rule base, first check whether the antecedents of the rule already exist in the rule base. If it does not exist, add it to the rule base; Otherwise, save the rule with the highest weight.

Fuzzy rule-based prediction. A central antifuzzy inference machine is used to organize rules with the same antecedents in the fuzzy rule base, obtain the consequent of the rules for fuzzy inference, and get the final fuzzy rule inference base. .

The drawback of this model is that the generated fuzzy rule library lacks good completeness and robustness, resulting in low model accuracy. Therefore, in order to improve the accuracy of the model, we need to optimize the method of constructing a fuzzy rule inference system to quickly and comprehensively construct a fuzzy rule inference system.

The proposed fuzzy prediction model

The construction of fuzzy rule base is crucial for fuzzy rule-based prediction model. The fuzzy rule base constructed based on WM model may be have redundant rules, and lack correspondence rules for new available sample due to the fuzzy regions uncovered by training data. To improve the computational efficiency and completeness of WM model, a fast and complete fuzzy rule base construction method based on WM model is proposed, then the prediction is performed based on the fuzzy rule base. The framework of the proposed fuzzy prediction model is shown in Fig. 1. In what follows, we explain the detailed steps of the proposed model.

Figure 1
figure 1

Framework of the proposed fuzzy prediction model.

Fuzzy rules extraction

Given the time series \(T=\{x_{1},x_{2},\ldots , x_{n}\}\), each input-output sample pair for training can be constructed as \(\{x_{i},x_{i+1},\ldots , x_{i+h-1}, y_{i}\}\), \(i=1, 2, \ldots , n-h\), where \(\{x_{i},x_{i+1}, \ldots , x_{i+h-1}\}\) is input sample, h is the length of input sample, and \(y_{i}=x_{i+h}\) is output sample. The domain of discourse is divided into q regions, then define the triangular fuzzy sets \(A_{1}, A_{2},\ldots , A_{q}\) based on these regions shown in Fig. 2.

Figure 2
figure 2

Triangular fuzzy sets.

Each feature of input-output sample pair can be assigned to the fuzzy set defined with the highest membership degree, i.e. \(x_{i}\) is fuzzified into \(A_{1,i}\) with the membership degree \(U_{1,i}\). The fuzzy rules can be extraction using WM method as follows:

$$\begin{aligned} {{\textbf {Rule}}}\ R_i: {{\textbf {IF}}}\ x_{i}\ is \ A_{1,i} \ \textrm{and} \ \cdots \ \textrm{and} \ x_{i+h-1} \ is \ A_{h,i}, {{\textbf {THEN}}} \ y_i \ is \ A_{y,i} \end{aligned}$$
(7)

where \(A_{j,i}\) is the jth antecedent, and \(A_{y,i}\) is the consequence. The rule that is generated from the training data be called data-generated rules, and fuzzy rule base can be constructed and denoted as \(R=\{R_1, R_2, \ldots , R_{n}\}\).

Fuzzy rules simplification

When the size of sample set is massive, a large number of fuzzy rules are generated. There are many fuzzy rules with same characteristics, such as redundant rules and conflict rules. Redundant rules refer to those rules with the same antecedents and consequence, and conflict rules are the rules that have the same antecedent but different consequences. To simplify the fuzzy rule base, the simplification strategy for redundant rules and conflict rules is proposed as follows:

(1) Redundant rules simplification. Find the group of date-generated rules that have the same antecedents and consequences, and then keep only one fuzzy rule among them and delete the group from the fuzzy rule base.

(2) Conflict rules simplification. Find the group of date-generated rules that have the same antecedents but different consequences, and then integrate the information of all fuzzy rules in the group to generate a new fuzzy rule. Delete the group from the fuzzy rule base and add the new fuzzy rule to the fuzzy rule base.

The process of conflict rules simplification are explained as follows. Assume the group found are \(R_{1}^{\prime }, R_{2}^{\prime }, \ldots , R_{m}^{\prime }\), the fuzzy rule \(R_i^{\prime }\) can be expressed as:

$$\begin{aligned} {{\textbf {Rule}}}\ R_i^{\prime }: {{\textbf {IF}}}\ x_{i1}\ \textrm{is} \ A_{i1} \ \textrm{and} \ \cdots \ \textrm{and} \ x_{ih} \ \textrm{is} \ A_{ih}, {{\textbf {THEN}}} \ y_i \ \textrm{is} \ A_{y_{i}} \end{aligned}$$

The weight of each fuzzy rule \(R_i^{\prime }\) can be computed by the product of membership function values for each antecedent:

$$\begin{aligned} W_i= \Pi _{j=1}^{h} U_{j,i} \end{aligned}$$
(8)

where \(U_{j,i}\) is the membership degree of \(x_{ij}\) to \(A_{ij}\). Then the value can be obtained by using the center-average defuzzification mechanism:

$$\begin{aligned} \hat{y}=\frac{\sum _{i=1}^{m} W_i\cdot \bar{y}_{i}}{\sum _{i=1}^{m}W_i} \end{aligned}$$
(9)

where \(\bar{y}_{i}\) is the central value of fuzzy set \(A_{y_i}\). Assuming that \(A_{\hat{y}}\) is the fuzzy set on which \(\hat{y}\) achieves the maximum membership, the new fuzzy rule is generated as follows:

$$\begin{aligned} {{\textbf {Rule}}}\ R: {{\textbf {IF}}}\ x_{i1}\ {\textrm{is}} \ A_{i1} \ {\textrm{and}} \ \cdots \ {\textrm{and}} \ x_{ih} \ {\textrm{is}} \ A_{ih}, {{\textbf {THEN}}} \, {{\hat{y}}_{i}} \, {\text{is}} \, {{A}_{\hat{y}}} \end{aligned}$$
(10)

Fuzzy rules complement

The fuzzy rules are extracted from the fuzzy regions that contain sample data, thus the data-generated fuzzy rule base is in general not complete. To extrapolate the data-generated fuzzy rule base over the regions not covered by these obtained rules, the fuzzy rule base should be complemented to cover the whole domain of discourse. Especially for the forecasting problem, a complete fuzzy rule base is crucial because the rules should be well-defined at all samples in the domain of discourse. To complement the fuzzy rule base, the complement strategy is proposed as the following three steps.

Step 1) For each combination of antecedents that does not appear in the fuzzy rule base, find the group of data-generated fuzzy rules that differ from the combination in only i antecedents, and call this group the i-group. Determine the first group that is not an empty, i.e. t-group.

Step 2) For all fuzzy rules in t-group, compute:

$$\begin{aligned} \hat{y}=\frac{\sum _{i=1}^{n_{t}} \bar{y}^{i}}{n_{t}} \end{aligned}$$
(11)

where \(n_{t}\) is the number of fuzzy rules in t-group, \(y^{i}\) is the central value of fuzzy set that is the consequence of ith fuzzy rule in t-group.

Step 3) Find the fuzzy set \(A_{\hat{y}}\) on which \(\hat{y}\) achieves the maximum membership. Assuming that the combination of antecedents is “\(x_{i1}\ \textrm{is} \ A_{i1} \ \textrm{and} \ \cdots \ \textrm{and}\) \(\ x_{ih} \ \textrm{is} \ A_{ih}\)”, the extrapolating rule is generated as:

$$\begin{aligned} {{\textbf {Rule}}}\ R: {{\textbf {IF}}}\ x_{i1}\ {\textrm{is}} \ A_{i1} \ {\textrm{and}} \ \cdots \ {\textrm{and}} \ x_{ih} \ {\textrm{is}} \ {A}_{ih}, {{\textbf {THEN}}} \, {\hat{y}} \, {\text{is}} \, {{A}_{\hat{y}}} \end{aligned}$$
(12)

The process is repeated until all the extrapolating rules are constructed. The complete fuzzy rule base can be obtained by integrating the extrapolating rules and data-generated rules.

Fuzzy rule-based prediction

Let \(\{x_{n-h+1}, x_{n-h+2}, \ldots , x_{n}\}\) be the testing sample, and each feature \(x_{i}^{\prime }\) is fuzzified into a fuzzy set \(A_{i}^{\prime }\). The antecedents of fuzzy rule is obtained as “\(x_{n-h+1}\ \textrm{is} \ A_{1}^{\prime } \ \textrm{and} \ \cdots \ \textrm{and} \ x_{n} \ \textrm{is} \ A_{h}^{\prime }\)”, and the matching fuzzy rule can be extracted from fuzzy rule base shown as:

$$\begin{aligned} {{\textbf {Rule}}}\ R: {{\textbf {IF}}}\ x_{n-h+1}\ \textrm{is} \ A_{1}^{\prime } \ \textrm{and} \ \cdots \ \textrm{and} \ x_{n} \ \textrm{is} \ A_{h}^{\prime }, {{\textbf {THEN}}} \ y \ \textrm{is} \ A_{y}^{\prime } \end{aligned}$$
(13)

Predicted value can be obtained by \(\hat{y}=y^{\prime }\), where \(y^{\prime }\) is the center of fuzzy set \(A_{y}^{\prime }\).

In this section, an improved Wang–Mendel model for rapid construction of fuzzy inference systems is proposed, which improves the shortcomings of the incomplete fuzzy rule inference base that the Wang–Mendel model may generate by using a simpler way. Thus a complete fuzzy rule inference base is built. In the process of building this fuzzy inference system, there is not much extra time and computational overhead.

The addition of the fuzzy prediction module will affect the computational efficiency of the model. In the experiment, we improve the computational efficiency of the fuzzy prediction module as much as possible through the following methods.

1) The data in the input part of the experiment is fixed. To reduce the calculation cost of the fuzzy rule module, the construction of the fuzzy rule base is performed offline in advance.

2) For the data in the prediction part of the experiment, the branch bound search algorithm is used to reduce the computational cost when the fuzzy prediction inference base is used to find the corresponding rules.

Fuzzy inference-based LSTM for time series prediction

In this section, the fuzzy inference-based LSTM (FLSTM) for time series forecasting is proposed. The proposed method incorporates the fuzzy prediction fusion, the strengthening memory layer, and the parameter segment sharing strategy into the LSTM network. Fuzzy prediction fusion model combines the fuzzy prediction with the three gates in LSTM to enhance the fuzzy reasoning capacity of the network. Strengthening memory layer integrates the hidden state and the cell state to strengthen the long-term memory. Parameter segment sharing strategy divides the overall output layer into different parts to balance processing efficiency and architecture discrimination. The proposed forecasting model is shown in Fig. 3, and described in detail in the following section.

Figure 3
figure 3

Framework of FLSTM model.

Fuzzy prediction fusion

The fuzzy prediction model is embedded in the LSTM to enhance the network reasoning capability and interpretability. The fuzzy rule can capture the dynamic characteristic of data change, and the reasoning relationship between the latest information and the historical information is extracted in the form of rules. The fuzzy prediction model can take full advantage of the latest information to prediction future behavior. Therefore, the LSTM combines with fuzzy prediction model can effectively overcome the lacking in the utilization of latest data.

LSTM utilizes gate cell to control information flow in recurrent computations. Therefore, the input gate, forget gate, and output gate are combined with fuzzy prediction to produce new output, which can integrate the fuzzy prediction information into the recurrent computations. The mathematical expressions can be expressed as follows:

$$\begin{aligned} f^{(f)}_t = \sigma (W_{fx}\cdot x_t+W_{fh} \cdot h_{t-1}+W_{ff} \cdot r_{t}+b_f) \end{aligned}$$
(14)
$$\begin{aligned} i^{(f)}_t = \sigma (W_{ix}\cdot x_t+W_{ih} \cdot h_{t-1}+W_{if} \cdot r_{t}+b_i) \end{aligned}$$
(15)
$$\begin{aligned} o^{(f)}_t = \sigma (W_{ox}\cdot x_t+W_{oh} \cdot h_{t-1}+W_{of} \cdot r_{t}+b_o) \end{aligned}$$
(16)

where \(r_{t}\) is output of fuzzy prediction model at time step t, and \(W_{ff}, W_{if}, W_{of}\) are weight matrices of \(r_{t}\) for input gate, forget gate, and output gate, respectively.

After the training of model, these weights can represent the strengths of the fuzzy rules in the different gates, thus the proposed input gate, forget gate, and output gate make the results more interpretable. Meanwhile, the fusion of fuzzy prediction information in the recurrent process increases the convergence speed of the training.

Strengthening memory layer

LSTM can learn long-term dependencies through deliberate design, and the critical component is the memory cell. To strengthen the long-term memory and alleviate the gradient dispersion problem of LSTM, the output needs to be determined by the current output and the cell state, thus the strengthening memory layer is proposed. In the strengthening memory layer, the current output and the cell state are combined to form a new unit. Then, the convolution Conv1d and tanh functions are used to extract more effective features to form the new memory cell. Finally, the output is generated by adding the current and new cell states, and it can be computed as follows:

$$\begin{aligned} ch_t=h_t+c_t \end{aligned}$$
(17)
$$\begin{aligned} s_t=\textrm{tanh}({ \mathrm Conv1d}([ch_t])) \end{aligned}$$
(18)
$$\begin{aligned} \hat{h}_t=ch_t+s_t \end{aligned}$$
(19)

Due to the addition of the new state, the latest information can be strengthened, and through the addition of new features, more information can be saved. The results of two kinds of feature information are combined in a summation way, which can strengthen the impact of the new state on the final result to a certain extent and make the results more comprehensive.

Parameter segment sharing strategy

Parameter sharing is a necessary method for controlling the number of model parameters, which improves the efficiency of the model. Parameter sharing is a reduction of the parameters that the model has to learn, which make the model processing more efficient. However, this also results in coupled optimization among different candidates, making architectures less discriminative. Therefore, a strategy of parameter segment sharing towards better trade-off between processing efficiency and architecture discrimination is proposed for LSTM. Let the prediction length be L, and the number of shared parameters be s, the \(k=L/s\) output layers are constructed to predict. Different output layers can capture temporal features from different time periods, which improves the architecture discrimination. Meanwhile, the output layer with s shared parameters guarantees the model processing efficiency. Finally, the output layer can be expressed in the form:

$$\begin{aligned} y_{t}=W_{yk}\cdot \hat{h}_t+b_{y} \end{aligned}$$
(20)

where \(y_{t}\) is the forecast result, \(W_{yk}\) is weight matrices, \(\hat{h}_t\) is the output of the strengthen layer, and \(b_y\) is bias.

FLSTM model

FLSTM is based on the LSTM model and integrates the fuzzy system to leverage the advantages of both fuzzy logic and deep learning. FLSTM combines the fuzzy prediction fusion, the strengthening memory layer, and the parameter segment sharing strategy to enhance the accuracy and interpretability of LSTM for long-term time series prediction. First, the proposed fuzzy prediction model is utilized to obtain the fuzzy rule-based prediction value. This information will be fused into the input gate, forget gate, and output gate of LSTM. Then, the strengthening memory layer sums the hidden state and the cell state to form a new state, and extracts more effective features using convolution and tanh functions. Add the new state and the new state after feature extraction to generate the strengthening memory state. The parameter segment sharing strategy can be flexibly adjusted based on different datasets and various transformations of prediction cycles and lengths, improving the model’s ability to extract periodic features from time series data and effectively manages the increasing of network parameter. Algorithm 1 shows the details of the FLSTM model.

Algorithm 1
figure a

Fuzzy inference-based LSTM (FLSTM).

Experimental study

To verify the prediction performance of FLSTM, a comparison with twenty-two prediction methods on seven collected real-world datasets is conducted. twenty-two time series prediction methods are selected for the comparative experiments, including three classical prediction method ARIMA11, SVR35, naive36, six deep learning-based prediction methods GRU37, DRNN38, LSTM39, Reformer22, LogSparse self-attention23, and Efficient attention40, seven LSTM-based fuzzy inference methods FD-LSTM41, FIS-LSTM42, SEIT2FNN43, RIT2NFS-WB44, MclT2FIS-UM45, MclT2TIS-US45, eIT2FNN-LSTM46, a LSTM-based fuzzy gaussian prediction method LFIGLSTM30, a fuzzy gaussian based fuzzy inference prediction method LFIGFIS47, a fuzzy prediction method FPFTS48, and a hybrid method MLP-Arima8, a nonlinear autoregressive neural network NAR49. Seven real-world datasets are the crucial indicators in the electric power deployment, air quality assessment, daily number of Covid-19 cases, monthly sunspot numbers, and daily maximum temperatures, i.e. Electricity Transformer Temperature (ETT)50, PM2.551, daily Covid-19 cases52, monthly sunspot numbers53, daily maximum temperatures54, abalone age51, mile per gallon51. To evaluate the prediction effectiveness of the proposed method, the six performance indexes, MSE, MAE, RMSE, SMAPE, MAPE, and MASE are adopted8, 40, 48. For the sake of fairness, the selection of prediction length is consistent with the the original paper of the compared models for different datasets. The results of the compared models are derived from reports in the original paper.

Experiment 1: Electricity Transformer Temperature time series

These time series are collected from Electricity Transformer Temperature (ETT)50 in Fig. 4, where \(\textrm{ETTh}_{1}\) and \(\textrm{ETTh}_{2}\) are created for 1-hour-level of 2 years data from two separated countries in China, and \(\textrm{ETTm}_{1}\) and \(\textrm{ETTm}_{2}\) are created for 15-minutes-level from the same datasets. Experimental parameters are set as follows: The update learning of parameters used Adam optimizer, the batch size is set to 32, the learning rate is set to 0.001, the training epoch is set to 100, the experiments times is set to 6, the dimension of the hidden layer is set to 64, the input and output channel are set to 64 for the Conv1d in the strengthening memory layer.

Figure 4
figure 4

Illustration of Electricity Transformer Temperature (ETT) time series.

For \(\textrm{ETTh}_{1}\) and \(\textrm{ETTh}_{2}\) time series, the prediction lengths are set to 3, 6, 12, 18, 24, 36, 48, and 168 used for the experiment. For \(\textrm{ETTm}_{1}\) and \(\textrm{ETTm}_{2}\) time series, the prediction lengths are set to 4, 8, 12, 16, 24, 32, 48, 96, and 288. The prediction performance evaluation of ARIMA11, GRU37, DRNN38, LSTM39, FD-LSTM41, FIS42, Reformer22, LogTrans23, Efficient-att40, and the proposed method with different prediction lengths on the 4 time series are listed in Tables 1, 2, 3 and 4. The best results are highlighted in boldface and the winning-counts are listed in the last column.

From Tables 1, 2, we can see that FLSTM achieves better results than LSTM on MSE by decreasing 19.9% (at 48) and 29.0% (at 168) in average. This reveals that FLSTM significantly improves the performance of LSTM. In comparison with ARIMA, GRU, DRNN, Reformer, and LogTrans, FLSTM outperforms the prediction performances of these method across all datasets. FLSTM beats Efficient-att mostly in winning-counts, i.e. \(14>3\) and \(15>3\), and surpasses Efficient-att on longer length (\(\ge 36\)). From Tables 3, 4, we can see that FLSTM achieves better results than LSTM on MSE by decreasing 28.9% (at 96) and 32.4% (at 288) in average. This demonstrates that FLSTM significantly improves the performance of LSTM. Comparison with ARIMA, GRU, DRNN, LSTM, FD-LSTM, FIS and Reformer, FLSTM outperforms the prediction performances of these methods across all datasets. FLSTM beats LogTrans and Efficient-att mostly in winning-counts, i.e. \(12>8\) and \(12>1\) for \(\textrm{ETTm}_{1}\), \(12>4\) and \(12>4\) for \(\textrm{ETTm}_{2}\). The experiment shows that the success of FLSTM in enhancing the prediction performance in the long-term prediction problem.

Table 1 Time series forecasting results on \(\textrm{ETTh}_1\) dataset.
Table 2 Time series forecasting results on \(\textrm{ETTh}_2\) datasets.
Table 3 Time series forecasting results on \(\textrm{ETTm}_1\) datasets.
Table 4 Time series forecasting results on \(\textrm{ETTm}_2\) datasets.

Experiment 2: PM2.5 time series

Currently, research on PM2.5 data has generated great enthusiasm, and more and more deep learning based models have been proposed and applied to long-term PM2.5 generation55, 56. Therefore, we have increased our research on time series prediction of PM2.5 data. These time series are collected from PM2.5 data51 in Fig. 5, where BeijingPM and ShanghaiPM are the PM2.5 data of Bejing and Shanghai in China from 2010 to 2015, including 50387 and 51892 observations respectively. Experimental parameters are set as follows: The update learning of parameters used Adam optimizer, the batch size is set to 32, the learning rate is set to 0.001, the training epoch is set to 100, the experiments times is set to 6, the dimension of the hidden layer is set to 64, the input and output channel are set to 64 for the Conv1d in the strengthening memory layer. The prediction lengths are set to 200, 400, and 600 used for the experiment. Tables 5 and 6 summarize the evaluation results of ARIMA11, LSTM39, FPFTS48, and FLSTM with the three long-term prediction lengths. The best results are highlighted in boldface and the winning counts are listed in the last row.

Figure 5
figure 5

Illustration of PM2.5 time series.

Table 5 demonstrates that FLSTM outperforms other methods for the PM2.5 time series of Beijing in terms of all evaluation metrics, except that FPFTS has the smallest RMSE when the prediction length is 600. The proposed method surpasses FPFTS mostly in winning-counts, i.e. \(5>1\). In comparison with LSTM, the proposed method has a RMSE decrease of 7.0% (at 200), 8.0% (at 400), and 8.1% (at 600). This demonstrates FLSTM acquires better prediction performance than LSTM. From Table 6, we can see that FLSTM for the two evaluation metrics with all prediction lengths outperforms ARIMA, LSTM, and WM for the PM2.5 time series of Shanghai. FLSTM surpasses FPFTS mostly in winning-counts, i.e. \(4>2\). In comparison with LSTM, FLSTM has an RMSE decrease of 38.2% (at 200), 26.6% (at 400), and 16.7% (at 600). This demonstrates FLSTM acquires better prediction performance than LSTM. The experiment shows that the success of FLSTM in improving the prediction capacity for long-term prediction.

Table 5 Time series forecasting results on PM2.5 time series of Beijing.
Table 6 Time series forecasting results on PM2.5 time series of Shanghai.

Experiment 3: Daily number of Covid-19 cases time series

This time series is collected from the daily number of Covid-19 cases database owned by the organization Our World In Data (OWID)52, and it is built by the number of daily cases in the world until April 25th, 2021 in Fig. 6. Experimental parameters are set as follows: The update learning of parameters used Adam optimizer, the batch size is set to 32, the learning rate is set to 0.001, the training epoch is set to 100, the experiments times is set to 6, the dimension of the hidden layer is set to 64, the input and output channel are set to 64 for the Conv1d in the strengthening memory layer. The prediction lengths are set to 7, 14, and 28 used for the experiment same as in the literature8. The prediction performance evaluation of ARIMA(2,0,4)(0,1,2), MLP(14,5,1), MLP-Arima8, and FLSTM for short (7 days), medium (14 days), and long (28 days) prediction lengths are list in Table 7. The best results are highlighted in boldface and the winning counts are listed in the last row.

Figure 6
figure 6

Illustration of the daily number of Covid-19 cases time series.

Table 7 demonstrates that FLSTM for MASE and SMAPE with all prediction lengths outperforms other methods for the daily number of Covid-19 cases time series. In comparison with the state-of-the-art method MLP-Arima8, FLSTM has a MASE decrease of 6.6% (at 7), 79.5% (at 14), and 19.1% (at 28). This demonstrates FLSTM acquires better prediction performance than MLP-Arima. From Table 7, we can see that FLSTM surpasses the comparative methods in all winning-counts. The experiment shows that the success of FLSTM in improving the prediction capacity for different prediction lengths.

Table 7 Time series forecasting results on the daily number of Covid-19 cases time series.

Experiment 4: Monthly sunspot numbers time series

Figure 7
figure 7

Illustration of the Zuerich monthly sunspot numbers time series.

This time series is collected from sunspot data in Fig. 7, where SUNSPOT53 is the sunspot data of Zuerich monthly sunspot numbers from 1749 to 1983, including 2819 observations respectively. Experimental parameters are set as follows: The update learning of parameters used Adam optimizer, the batch size is set to 32, the learning rate is set to 0.001, the training epoch is set to 100, the experiments times is set to 6, the dimension of the hidden layer is set to 64, the input and output channel are set to 64 for the Conv1d in the strengthening memory layer. The prediction lengths are set to 1, 55, 110, and 165 used for the experiment. Table 8 summarizes the evaluation results of LFIGLSTM30, LFIGFIS47, LSTM39, NAR49, ARIMA11, SVR35, naive36 and FLSTM with four prediction lengths. The best results are highlighted in boldface and the winning counts are listed in the last column.

Table 8 demonstrates that FLSTM for RMSE, MAPE and MAE with all prediction lengths outperforms other methods for the monthly sunspot numbers time series. In comparison with the state-of-the-art method LFIGLSTM, FLSTM has a RMSE decrease of 85.2% (at 1), 50.5% (at 55), 34.8% (at 110) and 27.2% (at 165). This demonstrates FLSTM acquires better prediction performance than LFIGLSTM. From Table 8, we can see that FLSTM surpasses the comparative methods in all winning-counts. The experiment shows that FLSTM has advantages over classical prediction models, deep learning prediction models, and hybrid prediction models in both short-term and long-term prediction tasks.

Table 8 Time series forecasting results on monthly sunspot numbers.

Experiment 5: Daily maximum temperatures time series

Figure 8
figure 8

Illustration of the Melbournea daily maximum temperatures time series.

This time series is collected from temperature data in Fig. 8 where Tmax54 is the temperature data of daily maximum temperatures in Melbournea from 1981 to 1990, including 3649 observations respectively. Experimental parameters are set as follows: The update learning of parameters used Adam optimizer, the batch size is set to 32, the learning rate is set to 0.001, the training epoch is set to 100, the experiments times is set to 6, the dimension of the hidden layer is set to 64, the input and output channel are set to 64 for the Conv1d in the strengthening memory layer. The prediction lengths are set to 1, 178, 356, and 534 used for the experiment. Table 9 summarizes the evaluation results of LFIGLSTM30, LFIGFIS47, LSTM39, NAR49, ARIMA11, SVR35, naive36 and FLSTM with four prediction lengths. The best results are highlighted in boldface and the winning counts are listed in the last column.

Table 9 demonstrates that FLSTM outperforms other methods for the daily maximum temperatures in terms of all evaluation metrics, except that LFIGLSTM has the smallest MAE when the prediction length is 356 and 534. In comparison with the state-of-the-art method LFIGLSTM, FLSTM has a RMSE decrease of 18.9% (at 1), 13.8% (at 55), 15.4% (at 110) and 9.5% (at 165). The experiment shows that FLSTM has significant advantages over these prediction models in both short-term and long-term prediction tasks.

Table 9 Time series forecasting results on maximum temperatures.

Experiment 6: Abalone age time series

Figure 9
figure 9

Illustration of abalone age time series.

Abalone age (ABALONE) time series51 is collected from the UCI machine learning repository as shown in Fig. 9, which includes 4177 observations. Experimental parameters are set as follows: The update learning of parameters used Adam optimizer, the learning rate is set to 0.001, the training epoch is set to 200, the experiment times is set to 6, the dimension of the hidden layer is set to 64, the input and output channel are set to 64 for the Conv1d in the strengthening memory layer. The prediction length is set to 835 used for the experiment. Table 10 summarizes the evaluation results of SEIT2FNN43, RIT2NFS-WB44, MclT2FIS-UM45, MclT2TIS-US45, eIT2FNN-LSTM46, and FLSTM. The best result is highlighted in boldface.

Table 10 demonstrates that FLSTM outperforms other methods for the abalone age prediction problem in terms of RMSE. In comparison with the state-of-the-art method eIT2FNN-LSTM, FLSTM has a RMSE decrease of 4.5%. The experiment shows that FLSTM has significant advantages over these prediction models.

Table 10 Time series forecasting results on abalone age prediction.

Experiment 7: Miles-Per-Gallon time series

Figure 10
figure 10

Illustration of Miles-Per-Gallon time series.

Miles-Per-Gallon (MPG)51 time series is collected from the UCI machine learning repository in Fig. 10 which includes 392 observations. Experimental parameters are set as follows: The update learning of parameters used Adam optimizer, the learning rate is set to 0.001, the training epoch is set to 200, the experiment times is set to 6, the dimension of the hidden layer is set to 64, the input and output channel are set to 64 for the Conv1d in the strengthening memory layer. The prediction length is set to 120 used for the experiment. Table 11 summarizes the evaluation results of SEIT2FNN43, RIT2NFS-WB44, MclT2FIS-UM45, MclT2TIS-US45, eIT2FNN-LSTM46, and FLSTM. The best result is highlighted in boldface.

Table 11 demonstrates that FLSTM outperforms other methods for the Miles-Per-Gallon prediction problem in terms of RMSE. In comparison with the state-of-the-art method eIT2FNN-LSTM, FLSTM has a RMSE decrease of 9.7%. The experiment shows that FLSTM has significant advantages over these prediction models.

Table 11 Time series forecasting results on Mile-Per-Gallon prediction.

Ablation study

To demonstrate the respective roles of different components in the proposed method, including the fuzzy prediction fusion (FPF), strengthening memory layer (SML), and parameter segment sharing (PSS) strategy, the ablation study on the \(\textrm{ETTh}_1\) dataset is carried out. For a finer analysis, the experimental results vary with different combinations of LSTM, FPF, SML, and PSS are shown in Tables 12 and 13 for different prediction lengths.

Results presented in Tables 12 and 13 reveal that the proposed method FLSTM outperforms all other combinations of LSTM, FPF, SML, and PSS for short-term and long-term predictions in terms of MSE and MAE. The three combinations with different components all improve the accuracy of LSTM, which verifies the respective roles of FPF, SML, and PSS. Although the combinations with two different components also obtain the best results as the proposed method, such as LSTM+SML+PSS at short-term prediction lengths, the performances of the methods drop when one component is removed from the proposed method for all long-term prediction lengths. This is attributed to the fact that each component has a positive impact on improving prediction capacity. The proposed method gathers the benefits of the three improvement components and gets the best performance for all prediction lengths.

Table 12 Ablation study on the \(\textrm{ETTh}_1\) dataset for short-term prediction.
Table 13 Ablation study on the \(\textrm{ETTh}_1\) dataset for long-term prediction.

Ethics declarations

There are not any experiments on humans and/or animals involved in this study.

Conclusion

LSTM-based models yielded great success in the time series forecasting research field, but yet these methods have their main general drawbacks as accumulated error, diminishing temporal correlation, and laking interpretability. This research is undertaken to design a time series prediction model by integrating linear model Wang–Mendel fuzzy inference prediction method and LSTM network, which makes the model parameters more scientific and interpretable, and improves its performance in short-term time series prediction tasks. This study also aims to solve the problem of LSTM’s poor performance in long-term time series prediction tasks. We strengthened the long-term memory by using the strengthening memory layer, and balanced the processing efficiency and structural discrimination of the model by using the parameter segmentation sharing strategy, which solved the problem of LSTM’s poor performance in long-term time series prediction due to the gradient dispersion problem.

Seven publicly available time series are used to compare the prediction performance of the proposed method with eight methods, including three classical prediction method ARIMA, SVR, naive, six deep learning-based prediction methods GRU, DRNN, LSTM, Reformer, LogSparse self-attention, and Efficient attention, seven LSTM-Based Fuzzy inference methods FD-LSTM, FIS-LSTM, SEIT2FNN, RIT2NFS-WB, MclT2FIS-UM, MclT2TIS-US, eIT2FNN-LSTM, a LSTM-based fuzzy gaussian prediction method LFIGLSTM, a fuzzy gaussian based fuzzy inference prediction method LFIGFIS, a fuzzy prediction method FPFTS, and a hybrid method MLP-Arima, a nonlinear autoregressive neural network NAR. In comparison with the classical prediction method, FLSTM outperforms the prediction performances of the method across all datasets. In comparison with the hybrid method, FLSTM acquires better prediction performance for all prediction lengths. In comparison with deep learning-based prediction methods, FLSTM beats these methods in winning-counts. In comparison with the fuzzy prediction method, FLSTM outperforms the prediction performances of the method in terms of winning-count. The experiments show that the success of FLSTM in improving the prediction capacity for long-term prediction. FLSTM has disadvantages in computational complexity. FLSTM can only predict one step at a time, thus the time cost becomes larger as the prediction length increase. The fixed fuzzy rule generation mechanism also limits the flexibility of prediction. Of course, these also provide ideas for future research.

Future research will include the following: (1) support multi-step prediction at a time; (2) provide fuzzy reasoning with different cycle lengths; (3) extend LSTM network to more complex data; (4) apply the proposed method to other appealing directions.