Abstract
Traditional recurrent neural networks (RNNs) are essential for processing time-series data. However, they function as read-only models, lacking the ability to directly modify the data they learn from. In this study, we introduce the corrector long short-term memory (cLSTM), a Read & Write LSTM architecture that not only learns from the data but also dynamically adjusts it when necessary. The cLSTM model leverages two key components: (a) predicting LSTM’s cell states using Seasonal Autoregressive Integrated Moving Average (SARIMA) and (b) refining the training data based on discrepancies between actual and forecasted cell states. Our empirical validation demonstrates that cLSTM surpasses read-only LSTM models in forecasting accuracy across the Numenta Anomaly Benchmark (NAB) and M4 Competition datasets. Additionally, cLSTM exhibits superior performance in anomaly detection compared to hierarchical temporal memory (HTM) models.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Recurrent neural networks (RNNs) form the foundation of sequence learning. In a diverse array of machine learning applications that necessitate sequence modeling, variations of RNNs, such as long short-term memory (LSTM) and gated recurrent units (GRU), have demonstrated their efficacy in capturing long-term dependencies. LSTM and GRU have found success in applications ranging from natural language processing [1] and text classification [2] to speech recognition [3] and forecasting ([4, 5]). In the realm of forecasting, RNNs have emerged as formidable contenders against traditional statistical methods [6, 7], particularly highlighted by their impressive performance in winning the M4 competition [8].
At the heart of the RNN’s functionality lies its ability to propagate information from past observations to future ones through internal states. This capability has enabled RNNs to excel in capturing nonlinear patterns within data [9]. Long short-term memory (LSTM), a sophisticated variant of RNN, employs gate mechanisms [10], enhancing the network’s aptitude for encoding long-term dependencies. The LSTM cell generates two states: a hidden state and a cell state. The cell state represents a cumulative memory of the LSTM network across multiple time steps, making it a repository for preserving long-term information [11]. We posit that understanding the dynamics of cell states (CSD) could furnish valuable insights into the characteristics of the learned data. Consequently, we investigate the feasibility of utilizing CSD to identify learning process problems within the training data. In contrast to conventional RNNs, which essentially possess read-only access to data, we introduce a novel RNN variant endowed with both read-and-write privileges.
Existing literature reveals a schism between researchers focusing on improving algorithmic learning capacity and those dedicated to enhancing data quality. Some researchers have leveraged the model’s feedback to rectify data discrepancies (e.g., employing prediction errors [12]). However, these approaches generally treat model learning and data preprocessing as separate pursuits.
To bridge this gap, we introduce a novel paradigm known as read & write machine learning (RW-ML). RW-ML augments traditional read-only models with the capability to not only learn from data but also dynamically modify it when required. This paradigm shift opens up new avenues for enhancing the adaptability and performance of machine learning models, particularly in dynamic and evolving environments. We propose a fresh variant of recurrent neural networks (RNNs) called the corrector long short-term memory (cLSTM). The principal objective of cLSTM is to seamlessly integrate data preprocessing into the learning process. By harnessing the cell and hidden states inherent in LSTM models, we hypothesize that these states furnish valuable insights for identifying learning process aberrations (see: Fig. 1). Such insights empower the model to dynamically adjust the data, optimizing the learning process and consequently refining predictive outcomes. Through astute utilization of LSTM’s cell and hidden states, cLSTM optimizes data processing and learning, culminating in heightened predictive accuracy.
Extensive experiments on benchmark datasets, including the Numenta Anomaly Benchmark (NAB) and M4 competition dataset, validate cLSTM’s effectiveness in forecasting and anomaly detection tasks, demonstrating its superiority over traditional read-only LSTM models and hierarchical temporal memory, respectively.
The main contributions of the paper are summarized as follows:
-
1.
Introduce the read & write machine learning paradigm.
-
2.
Utilization of cell states for detecting learning problems.
-
3.
Introduction of cLSTM, a RW-ML model leveraging internal states for dynamic data adjustment.
-
4.
Empirical validation through experiments on NAB and M4 competition datasets.
-
5.
Superiority of cLSTM over traditional LSTM and hierarchical temporal memory in forecasting and anomaly detection tasks, respectively.
2 Background
In the landscape of machine learning research, a distinct disparity has existed, wherein the emphasis on enhancing models has often overshadowed the crucial importance of data collection and data quality considerations [13]. This discrepancy is palpable in the industry, where a disproportionate allocation of 90% of machine learning research efforts is directed toward refining algorithms, leaving a mere 10% for data preparation. Some argue that this distribution should be inverted [14]. Scholarly inquiries into data preparation have given rise to various approaches applied prior to the learning process. These methodologies are often categorized as anomaly detection [12, 15,16,17], denoising [18], and concept drifts [19].
Many of these endeavors rely on prediction errors. For instance, [12] and [20] employ LSTM models to predict future time steps and identify substantial deviations from these predictions. Discrepancies between observed and predicted values are compared to a threshold, determining the presence of faulty inputs if the disparity surpasses. A similar approach has been extended to collective anomaly detection [21]. Moreover, [22] harnesses the hidden states of the hierarchical temporal memory algorithm to compute deviations. [23] introduced AOSMA-LSTM, an LSTM that integrates an enhanced version of the Aquila optimizer (AO) algorithm with the search mechanisms of the slime mold algorithm (SMA).
Recent research has explored the use of transformers in time-series forecasting, with a focus on improving both accuracy and efficiency. [24] and [25] both investigated the application of transformer-like models in time-series data, with [24] concentrating on next-frame prediction and [25] on solar power generation data. In [26], the authors proposed Informer, an efficient transformer-based model for long sequence time-series forecasting. In [27], the authors introduced Scaleformer, a multi-scale refining transformer, which iteratively refines forecasted time series at multiple scales, achieving significant performance improvements. Moreover, [28] introduced ResInformer, a novel transformer-based approach tailored for forecasting PM2.5 concentration in major Chinese cities. It leverages an improved Informer architecture with attention distillation and residual block-inspired structures. Through extensive evaluation using 98 months of air quality index datasets, ResInformer outperformed Informer, showcasing its potential for accurate and efficient air pollution forecasting in urban environments. Similarly, [29] introduced ETSformer, a time-series transformer architecture that incorporates exponential smoothing to enhance forecasting. Additionally, [30] proposed state-denoised recurrent neural networks, aiming to denoise sequential data for improved modeling. A series of transformer-based models has been proposed for time-series forecasting, each with its unique features. [31] introduces Earthformer, a space-time transformer for Earth system forecasting. Furthermore, [32] enhances the performance of the transformer for long-term series forecasting with FEDformer, combining the transformer with the seasonal-trend decomposition method. [33] focuses on multivariate time-series forecasting, proposing Crossformer, a transformer-based model that captures both cross-time and cross-dimension dependencies.
Recent developments have signaled a shift toward data-centric AI [34], where the primary objective is to utilize machine learning to enhance data quality, ultimately leading to improved model accuracy. Early research efforts have explored hybrid approaches aiming to leverage the intrinsic interplay between data and predictive performance. For instance, [35] and [36] proposed a hybrid model integrating an LSTM network with the Seasonal Autoregressive Integrated Moving Average (SARIMA) model. Similarly, [12, 15, 20, 37, 38] utilized LSTM, bidirectional LSTM (BiLSTM), and LSTM autoencoders as preprocessing steps to detect anomalies in the data. These methods rely on discrepancies between predicted and observed values, quantified as prediction errors. Variants of recurrent neural networks (RNNs) like delayed LSTM (dLSTM) [39] have also harnessed predictive errors from normal data to uncover anomalies. Autoencoder-based techniques [38, 40] have also been employed for anomaly detection. Similarly, [41] introduced an attention-based model named ACL-SA, which combines Convolutional Neural Networks (CNNs) with long short-term memory (LSTM) networks for enhancing data in text classification tasks. Numerous methodologies have demonstrated that neural networks display reduced sensitivity to minor levels of noise [19]. These studies suggest that noise can even contribute to improved generalization and convergence [42]. However, this principle does not universally apply to all forms of noise, such as non-white noise in input data [43] or inaccuracies in labels [44]. In scenarios where an anomaly is detected, various correction techniques involve substituting faulty data with predicted sequences, combining the two [45], or employing generated sequences (e.g., TAD-GAN [17]). It is essential to recognize that faults are often categorized as anomalies, missing data, or noise. However, beyond these classifications, other types of errors, including those identified by the learning process itself, can profoundly impact the learning process. A recent approach [46] advocates for detecting prediction errors during the learning process. This strategy allows data to be processed using the model’s feedback, opening up promising avenues for data-centric AI and its potential to further enhance model performance.
2.1 Long short-term memory
Long Short-Time Memory (LSTM) is a type of recurrent neural network (RNN) specifically designed to capture and model long-term interactions within sequential data. In contrast to feed-forward neural networks, RNNs possess recurrent connections that enable them to learn from sequential input, making them adept at handling time-dependent patterns. However, traditional RNNs are prone to the issue of vanishing and exploding gradients [47], which hinders their ability to effectively learn long-term dependencies.
To overcome this limitation, LSTM incorporates a gating mechanism [10], which involves the addition of three gates to the standard RNN architecture: input, output, and forget gates. These gates are realized through specific mathematical transformations (Eqs. 1b to 1d), resulting in the creation of two essential states: hidden states, denoted as h (Eq. 1g), and cell states, denoted as c (Eq. 1f). These states enable the LSTM network to retain relevant information from previous time steps and seamlessly combine it with the current input.
As data is fed into the LSTM network, the input gate, implemented with a sigmoid function, determines which information is essential and should be retained. Simultaneously, the forget gate, also using a sigmoid activation, decides which information from the previous state should be discarded. The output gate, yet another sigmoid function, regulates which parts of the combined information should be exposed to the subsequent layers of the network. Through this gating mechanism, LSTM effectively manages the flow of information, allowing it to capture long-term dependencies and learn intricate temporal patterns within sequential data.
At each time step t, an LSTM cell takes an input \(x_t\) along with the previous hidden state \(h_{t-1}\) and other internal cell data to compute the next hidden and memory states through feed-forward and recurrent connections. Consequently, the LSTM cell generates both a hidden state (\(h_t\)) and a cell state (\(c_t\)) at each time step. The hidden state \(h_t\) can serve as the final output of the network or be passed to another LSTM cell if additional sequences need to be learned. Typically, the last hidden vector is fed through a linear, fully connected layer to produce predictions in sequence learning tasks.
3 Cell states dynamics analysis
The fundamental premise underlying the proposed method, cLSTM, revolves around the concept of cell state dynamics (CSD). In conventional LSTM networks, data-related information is captured and stored within its internal states. These states are then utilized to predict future values, under the assumption that the information contained is accurate and pertinent. However, input data may contain observations that can mislead the model’s learning and subsequent predictions. This scenario arises when the training process attempts to comprehend intricate data components while incorporating a subset of observations that might not align with the model’s requirements. This situation can have adverse effects on the model, potentially leading to suboptimal data modeling outcomes.
CSD pertain to alterations in the attributes of cell states over time [48]. In an LSTM architecture, the dimensionality of cell states corresponds to the LSTM size. In simpler terms, at each time step, a cell produces a cell state vector of a length equivalent to the number of hidden units. These cell states can be organized into a concatenated signal (\(c_1,...,c_T\)). In [49], the authors used similar approach for interpretability. The changes over time of the cell states or hidden states are activation signals of the model. When an LSTM network with N hidden units processes a time series, N values are produced at each time step, resulting in a concatenated vector highly representative of the seasonality and information captured from the input sequence and previous sequences.
Figure 2 illustrates the cell states learned from a time-series sequence using both LSTM and cLSTM. This depiction is corroborated by the corresponding time series shown in Fig. 2. Notably, the cell states displayed in Fig. 2exhibit minimal variation outside the delineated gray interval. In contrast, Fig. 2 depicts partially altered time-series data.
Cell states serve as representations of processed information extending from the starting of a sequence up to the current time step. Over the course of time, these states dynamically evolve in response to input data and its influence on the learning process. Nevertheless, instances where the input deviates unexpectedly or lacks relevance to the model’s objective prompt the LSTM to adapt its learning behavior. Consequently, this adaptation triggers changes in the cell states, as highlighted in Fig. 3. By scrutinizing these cell states, it becomes feasible to identify instances where the learning process deviates from anticipated behavior, thereby facilitating the detection of issues in the learning process.
4 Corrector LSTM
Incorporating data preprocessing tasks into machine learning processes enhances the accuracy and usability of analytical systems [50]. In addition to this, aligning data preprocessing with the learned model can result in accuracy gains. Neglecting model feedback in a preprocessing stage may lead to erroneous changes to the data [51]. To address this concern, we propose a solution that involves preprocessing the data in the learning process. We introduce corrector long short-term memory (cLSTMFootnote 1). The core principle underlying cLSTM revolves around integrating preprocessing into the learning process itself, recognizing the crucial influence of model feedback on effective data preprocessing. The cLSTM framework involves two key stages: detection and correction, as illustrated in Fig. 4.
4.1 Detection component
The detection component (DC) uses cell state dynamics produced by LSTM to identify abnormal learning behaviors. This component analyzes cell states as follows:
Let \( \mathcal {S} =(x_1,x_2,\cdots ,x_T) \) be a sequence drawn from a distribution \(\mathcal {D}{data}\). At each iteration, an observation \(x_t\) along with a hidden state from the previous observation \(h_{t-1}\) is processed by an LSTM cell, which uses that to update all its units through a forward pass and computes the error vector for all its weights through a backward pass. Finally, it produces a new cell state \(c_t\), which allows us to get the new hidden states \(h_t\). For a sequence of length k, the model produces k cell states over the range \([t-k, t]\) which can be represented as follows:
Where \(f_p\) and \(i_p\) are the forget and input gates, respectively. \(f_p\) determines what information should be kept from the previous time step, while \(i_p\) decides which information should be kept from the current time step. \(\hat{C_p}\) determines the new cell state candidates.
At a given epoch \(e_{i}\), a concatenation of the cell states (i.e., Equation 2) of all time step is used to train a Seasonal ARIMA model (SARIMA). During epoch \(e_{i+1}\), SARIMA forecasts the values of the cell states at each time step while LSTM is learning. These forecasts are compared to the cell states produced by LSTM in the same epoch. The comparison is performed using the Euclidean distance to quantify the difference between the forecasted and the actual cell states. A threshold \(\eta \) is employed to detect changes in the cell states: if the similarity measure exceeds \(\eta \), cLSTM identifies an issue in the learning process, prompting a modification to the corresponding input (see: Fig. 5). The choice of SARIMA is due to the stationary nature and high seasonality of the cell states. It is important to let the model train for a number of epochs, so the data is well represented in the cell states.
4.1.1 Seasonal ARIMA
Seasonal ARIMA (SARIMA) is an adaptive ARIMA model used when the time-series exhibits seasonal variation. ARIMA is defined using (p,d,q) parameters, also called the ARIMA order. d is the level of differencing, p is the autoregressive order, and q is the moving average order [52]. The ARIMA model is defined in Eq. 3.
where \(z_t\) is the level of differencing, the constant is denoted by \( \delta \), while \(\phi _i\) is an autoregressive operator, \(a_i\) is a random shock corresponding to time period t, and \(\theta _i\) is a moving average operator.
SARIMA adds to ARIMA an order \((\textit{P},\textit{D},\textit{Q})_s\) which corresponds to seasonal autoregressive (P) and a seasonal moving average notation (Q). The variable s indicates the length of the seasonal period. For example, a sequence with a seasonal period of 20 observations would have, \(s=20\).
The SARIMA order selection process is automated using the auto_arima package, which conducts a grid search over a range of parameters (\( p \), \( d \), \( q \)) for the non-seasonal component and (\( P \), \( D \), \( Q \)) for the seasonal component. By exploring various combinations of these parameters, the package identifies the optimal SARIMA model for the given time-series data. Additionally, the parameter \( m \) is set to the size of the LSTM, representing the length of the LSTM cell states produced at each timestamp. This automated approach streamlines the SARIMA order selection process, enabling efficient and data-driven forecasting.
4.1.2 Similarity measure
cLSTM detects problems in the learning process by predicting the values of the cell states and comparing them to the observed ones. Therefore, it is necessary to measure the similarity between these two signals. Thus, the Euclidean point-by-point mapping approach has been used Eq. 4.
If an erroneous behavior occurs at time step t, depending on the threshold, cLSTM can trigger the correction component (see: Fig. 4). Similarly to constant error flow [53], a constant data error (CDE) flows gradually over time. CDE can be represented as follows:
Where \( \epsilon _{p} \) represents the CDE caused by the learned weights related to disrupting inputs from the previous iteration.
The detection threshold in cLSTM serves as a criterion for determining whether the disparity between the estimated cell states and the actual cell states generated by the LSTM cell is significant. If the calculated distance surpasses this threshold, indicating substantial dissimilarity, a correction process is initiated. During this phase, the model iterates through various adjustments to the input data, prompting the LSTM to generate new cell state signals for comparison with the estimated ones. This iterative correction process continues until the distance between the estimated and actual cell states falls below the correction threshold, signifying successful learning. Upon meeting this criterion, the model proceeds to process the next input.
4.2 Correction component
The Correction Component (CC) reconstructs each input associated with issues detected by the detection component (DC) in the learning process. The model states are preserved at each time step. Subsequently, the model is loaded prior to modifying the data.
In the data correction process, the correction component initially modifies the data using an update value parameter, denoted as \( \alpha \), such as \( \alpha = 0.1 \), selected based on the dataset’s scaling. If the scaling function used is standardization, employing the default update value of 0.1 can be a good choice. However, if the normalization method employed is Min–Max scaling, we preferentially reduce the update value to 0.01, as an example. The correction component monitors the LSTM model’s cell states following the data adjustment. Through an iterative process, the correction component continually updates the input using \( \alpha \) in the same direction if the cell states align with the estimated states until the desired correction level is reached. This ensures the data conforms more accurately to the learning problem, enhancing the model’s ability to capture underlying patterns and relationships.
Once the data has been altered, the model is retrained on it, and the model states are once again preserved. This cycle is repeated until the disparity between the SARIMA forecast and the updated cell states becomes less than a designated reconstruction threshold \(\delta \). The corresponding objective function can be defined in Eq. 6.
Where \(S_t\) is the SARIMA forecast and \(C_t\) is the actual cell states at time step \(\textit{t}\).
We use \( \eta \) and \( \delta \) as detection and correction thresholds (see: Fig. 4), and \( \delta < \eta \). The choice of the detection threshold \( \eta \) and correction threshold \( \delta \) strongly depends on the scaling method used. In the context of LSTM models, the cell states are influenced by the input data and the weights learned during training. The scaling method applied to the input data affects the values of the cell states because the input data directly influences the cell state values through the input gates, forget gates, and output gates of the LSTM units. For instance, if the input data is standardized, the values fed into the LSTM units are within a certain range, typically around 0 with a spread determined by the standard deviation. As the LSTM units process the input sequence and update their cell states over time, the values of the cell states tend to stay within a similar range, which in our experiments appears to be between -1 and 1 due to our choice of standardization as the scaling method.
Once Eq. 6 is solved, we can confirm that \(\hat{X_t}\) does not affect the cell states behavior. Figure 6 shows cell states before and after the reconstruction.
5 Empirical study
We conduct a comparative analysis between cLSTM and LSTM using multiple time series. Our experimental setup involves utilizing univariate time series as training data. To predict multi-time step labels during the learning process, we leverage multi-time step samples. Upon completion of the training, the outcomes encompass the network’s weights and the refined time series. Given that the core objective of cLSTM is to enhance data quality, we also assess the algorithm’s proficiency in identifying and rectifying anomalies using the Numenta Anomaly Benchmark. Lastly, we provide insights into the computational overhead.
5.1 Research questions
We delineate the following pivotal research questions as the focal points of our study:
-
1.
Does cLSTM perform better than LSTM in time series used in time-series forecasting research?
-
2.
Does cLSTM perform better than LSTM in time series used for anomaly detection research?
-
3.
Can cLSTM be used for anomaly detection?
6 Experimental setup
LSTM is the prime model for comparison with cLSTM. Thus, our study is exclusively centered on the head-to-head analysis of cLSTM and LSTM. Both models, LSTM and cLSTM, were evaluated using the same datasets and a fixed seed. Moreover, for anomaly detection task, cLSTM has been compared to hierarchical temporal memory networks.
6.1 Datasets
We used an extensive dataset containing 1051 time-series, representing various applications. Within this dataset, 995 time-series were sourced from the renowned Macro M4 competition datasets. The Macro M4 dataset comprises six subsets, of which we chose five containing 199 time-series each. Additionally, we incorporated 55 time-series from Numenta Anomaly Benchmark (NAB) dataset. The NAB dataset features different types of time-series from various areas [54]. These time-series come from different sources and include different levels of anomalies. Each time-series in the NAB dataset is labeled accordingly. NAB’s categories include Artificial data Without anomalies (AWt), Artificial data With anomalies, realAdExchange (EX), realAWSCloudwatch (AWS), realKnownCause (RKC), realTraffic (Traffic), and realTweets (Tweets). For time-series exceeding 1000 data points, we used the first 500 points to manage computation.
6.2 Models’ parameters
For implementation, we used LSTM developed with PyTorch. The model input sequences are of length 1, generating cell state vectors of length 12 for each unit (given the LSTM size of 12). The model consists of a single layer, and training occurred over 50 epochs. Cell state dynamics were predicted at each timestamp, resulting in a SARIMA forecast length of 12 steps. The training parameters for the cLSTM and LSTM models are compared in Table 1.
In determining the threshold for the disparity between real and predicted cell states, our methodology involves a fine-tuning process. We systematically explore threshold values ranging from 0.6 to 1.4 in increments of 0.2 and evaluate the model’s performance across multiple experiments to identify the optimal threshold. This iterative approach allows us to select the threshold value that maximizes the model’s efficacy in learning problem detection. Additionally, the choice of threshold depends on the range of values exhibited by the cell states. Therefore, it primarily relies on the LSTM parameters and the scaling applied to the input data.
6.3 Evaluation
To ensure an equitable assessment between LSTM and cLSTM, a fixed initial configuration was applied across experiments involving the NAB dataset and each subset of the M4 datasets, for various learning problems detection thresholds. Experiments were conducted using five detection thresholds (\(\eta \)) at 0.6, 0.8, 1, 1.2, and 1.4. The correction threshold (\(\delta \)) was consistently set at 0.2. No fine-tuning of \(\delta \) was carried out, and its impact was not analyzed in this context. Convergence was determined by an early stopping criterion of five iterations, triggered when the model could not reach a distance below \(\delta \). During testing, an 88% training and 12% testing holdout was used. Prior to training, a standardization step was applied. For consistency and fair evaluation, we trained both LSTM and cLSTM models using PyTorch implementations with 50 epochs and a fixed seed.
6.3.1 Metrics
Our models were assessed using the mean absolute scaled error (MASE) [55], ideal for time-series with diverse scales. We also employed a combination of mean squared error (MSE) and the Diebold Mariano statistical test [56] to discern Wins, Draws, and Losses. A Draw implies comparable performance between the two models.
Incorporating both mean squared error (MSE) and mean absolute scaled error (MASE) is crucial to comprehensively assess learning performance. MSE effectively identifies significant predictive errors, particularly those stemming from outliers, by giving more weight to larger errors. This allows us to evaluate how well models handle substantial deviations from actual values. On the other hand, MASE provides a comprehensive performance measure that takes into account the scale of the data, facilitating a fair comparison across various forecasting methods. By carefully utilizing both MSE and MASE, we gain a complete understanding of model performance across datasets with varying levels of quality.
6.4 Experimental results
The outcomes of this study are grouped into three sections, each corresponding to one of the three research questions. The experiments encompass two distinct scenarios for forecasting: using the M4 competition and NAB datasets.
6.4.1 cLSTM vs LSTM on M4
An architecture based on LSTM emerged victorious in the M4 competition [8]. This prompted us to conduct an in-depth comparison between cLSTM and LSTM, using datasets that were part of the same competition. The results are presented in Tables 2 and 3, showcasing the Wins for various thresholds, both collectively and individually. The findings underscore cLSTM’s superiority in the Daily, Weekly, and Monthly datasets. This pattern is further highlighted in Fig. 7. Conversely, LSTM demonstrated its prowess in the Hourly and Yearly datasets. This divergence could be attributed to the granularity and distinct patterns within each dataset. These trends persisted across different detection thresholds, with a threshold value of 1.2 proving to be the optimal on average. Furthermore, the increased variability within the Hourly dataset might contribute to the heightened detection of false positive corrections. Conversely, the smaller size of the Yearly sequences limits cLSTM’s exposure to training data, possibly leading to less significant cell states dynamics. An example of an Hourly time series is depicted in Fig. 8. Additionally, fine-tuning the detection threshold \(\eta \) emerges as a pivotal consideration for cLSTM, as it substantiates correction precision (see: Table 3).
For a comprehensive evaluation of cLSTM’s Wins and Losses, we computed average MASE values in both scenarios (see Tables 4 and 5). The MASE values resonate with Wins and Losses, except for Weekly, consistently reflecting lower LSTM’s MASE values. Given the number of Wins in Table 3, we deduce that cLSTM does not elevate model performance beyond LSTM. However, cLSTM’s corrective potential enabled it to mitigate errors that influenced overall MSE across these datasets. Notably, LSTM MASE’s standard deviation is higher in Win cases compared to cLSTM’s in Loss cases. Additionally, the findings suggest that cLSTM fares better in datasets where LSTM’s MASE values are higher. In essence, cLSTM outperforms LSTM in datasets with outliers. Figure 9 illustrates corrections applied to original time series, while the right side displays forecasting outcomes. Another instance in Fig. 8 demonstrates the impact of high number of corrections on forecasts.
Regarding Loss cases, both Figs. 10 and 11 exhibit a similar trend. This trend could be attributed to the prevalence of false positives in such instances. Notably, Fig. 10 showcases cLSTM’s superior aptitude in predicting seasonality. Figure 12 portrays a step function time series. Here, cLSTM slightly adjusted training data, leading to a detrimental effect on accuracy. Although we integrated early stopping for situations where cLSTM fails to identify the correct value, further enhancements are necessary to prevent substantial corrections in the wrong direction (refer to Fig. 13).
During the training of the corrector long short-term memory (cLSTM) model, we closely monitored the training loss, which consistently decreased over epochs, signifying robust convergence immediately following the implementation of corrections at epoch 50, as chosen in this study (see: Fig. 14). This suggests that the model effectively grasped the underlying patterns of the training data. We also observed a strong correlation between the validation loss and the MSE on the validation dataset. Lower MSE corresponded to lower validation loss, indicating accurate predictions by the cLSTM model on validation data. Conversely, higher MSE values led to higher validation loss, highlighting significant deviations from the ground truth. This correspondence underscores cLSTM’s prowess in minimizing forecasting errors.
6.4.2 cLSTM vs LSTM on NAB
In pursuit of improved data quality, the logical course involves addressing anomalies. This section delves into the performance of cLSTM and LSTM on data featuring anomalies, exemplified by the Numenta Anomaly Benchmark (NAB) dataset. Table 6 reveals cLSTM’s superiority in 26 time-series instances, parity in 24 cases, and trailing by six cases. Figure 15 provides a glimpse of corrected anomalous data (Left) and its associated forecast (Right). Furthermore, Table 7 lays out MASE values for cLSTM and LSTM. It emerges that cLSTM reduces average errors in three out of seven datasets, equals in 2, and lags in 2. Despite cLSTM’s higher count of Wins in the Tweets category (8 out of 10), MASE values hint at cLSTM’s underperformance compared to LSTM. This could be attributed to cLSTM’s rectification of larger errors, which lowered the MSE values without substantially altering the model’s global performance.
6.4.3 Anomaly detection
Hierarchical temporal memory (HTM) networks, as indicated in [22], demonstrate robustness in identifying anomalies within the Numenta Anomaly Benchmark (NAB) dataset. We have conducted a comparative assessment of both HTM and cLSTM, as detailed in Table 8. The anomaly labels and HTM outcomes were sourced from.Footnote 2 We used a threshold of 0.85 onto the raw HTM score, deeming data points surpassing this threshold as anomalies. Within this framework, true positives (TP) pertain to the number of detected data points accurately labeled as anomalies within NAB. Conversely, false positives (FP) represent the count of detected data points not annotated as anomalies in the NAB dataset. The results vividly illustrate that cLSTM surpasses HTM in robustly detecting TP errors. Both HTM and cLSTM exhibit a relatively high number of FP instances. Given cLSTM’s primary design goal of enhancing learning, the model also identifies other errors associated with the model itself, alongside anomalies. An illustrative depiction of HTM and cLSTM detection outcomes is presented in Fig. 16. In this figure, the original and corrected time series are represented by the black and red lines, respectively. The black dot signifies a data point flagged as an anomaly by HTM. On the other hand, the red cross denotes a data point corrected by cLSTM. Notably, in Fig. 16, we observe that cLSTM, in contrast to HTM, does not alter data solely based on statistical significance (e.g., outliers). Instead, cLSTM takes into account the extent to which these data points influence cell state behavior. For instance, within the interval 50–100, there exists a peak in the time series, identified as an anomaly by HTM, yet not reconstructed by cLSTM. In general, cLSTM exhibits a strong capability to detect and rectify outliers and various anomalies.
6.4.4 Computational cost
In this investigation, we calculated the computational time, particularly considering the computational requirements of the detection and correction components, which involve training and forecasting using SARIMA. As a result, an extra computational overhead is incurred in addition to the time taken for LSTM learning (refer to Table 9). These experiments were conducted on a machine equipped with an i7 processor featuring 6 CPU cores @ 2.60 GHz and 16 GB of RAM.
cLSTM incorporates a meta-learner during the detection process, which leads to a longer computational time compared to LSTM. This accounts for the extra time needed by cLSTM over LSTM, as demonstrated in Table 9. The table displays the average additional time required by cLSTM compared to LSTM. These timings correspond to those observed during the experiments conducted in this paper.
7 Limitations and future works
One of the main limitations of cLSTM, rooted in its read & write machine learning (RW-ML) framework, is the computational overhead associated with its reliance on a meta-learner during the detection process. This component is computationally expensive as it requires fine-tuning to find the optimal orders. Future research could explore alternative methods, such as incorporating gating mechanisms directly into the LSTM cell architecture, to mitigate this computational burden and replace the meta-learner. Additionally, optimizing the correction component by integrating optimization search algorithms could enhance the efficiency of data correction processes and prevent the model from making incorrect corrections. Furthermore, investigating the model’s generalization capability across diverse datasets, enhancing the interpretability of cell states dynamics (CSD), and adapting cLSTM to dynamic environments are all promising avenues for future research. Improvements in these areas could further enhance the applicability and democratization of machine learning within the RW-ML paradigm.
8 Conclusions
Recurrent neural networks often face challenges due to data quality issues that impact their learning process. To bolster their resilience and tackle these challenges, the common approach involves using data preprocessing techniques. Alternatively, one can adjust the data during the learning phase to address emerging problems. In our research, we introduce a novel approach named corrector long short-term memory (cLSTM), which leverages the cell states within the LSTM model. Cell states dynamics empowers cLSTM to recognize learning issues and adapt the input data accordingly. This is facilitated by a meta-learner that predicts these states. We verified the effectiveness of cLSTM by conducting comprehensive comparisons with the standard LSTM using the M4 and NAB datasets. The outcomes unequivocally demonstrate that cLSTM enhances the learning process by addressing learning issues occurring in the cell states dynamics. Throughout the study, cLSTM outperforms LSTM, especially in improving forecasting accuracy when LSTM struggles with errors. Moreover, comparisons with both HTM and cLSTM on the NAB dataset highlight cLSTM’s potential to improve data quality, especially in identifying anomalies. To conclude, our findings highlight cLSTM’s ability to enhance forecasting accuracy, potentially transforming the learning paradigm from read-only to read-and-write models. An interesting direction for future exploration is to investigate if similar improvements can be applied to other neural network architectures beyond LSTM.
Data availability
All data are fully available online.
Code availability
References
Yin W, Kann K, Yu M, Schütze H (2017) Comparative study of CNN and RNN for natural language processing. CoRR arXiv:1702.01923
Zhou C, Sun C, Liu Z, Lau FCM (2015) A C-LSTM neural network for text classification. CoRR arXiv:1511.08630
Graves A, Mohamed A-r, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: 2013 IEEE international conference on acoustics, speech and signal processing, IEEE, pp 6645–6649
Zaytar MA, El Amrani C (2016) Sequence to sequence weather forecasting with long short-term memory recurrent neural networks. Int J Comput Appl 143(11):7–11
Siami-Namini S, Tavakoli N, Namin AS (2019) The performance of lstm and bilstm in forecasting time series. In: 2019 IEEE international conference on big data (Big Data), IEEE, pp 3285–3292
Zheng J, Huang M (2020) Traffic flow forecast through time series analysis based on deep learning. IEEE Access 8:82562–82570. https://doi.org/10.1109/ACCESS.2020.2990738
Praveen Kumar B, Hariharan K, Shanmugam R, Shriram S, Sridhar J (2022) Enabling internet of things in road traffic forecasting with deep learning models. J Intell Fuzzy Syst 43(5):6265–6276. https://doi.org/10.3233/JIFS-220230
Makridakis S, Spiliotis E, Assimakopoulos V (2018) The m4 competition: results, findings, conclusion and way forward. Int J Forecast 34(4):802–808
Khashei M, Bijari M (2011) A novel hybridization of artificial neural networks and Arima models for time series forecasting. Appl Soft Comput 11(2):2664–2675. https://doi.org/10.1016/j.asoc.2010.10.015
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Ming Y, Cao S, Zhang R, Li Z, Chen Y, Song Y, Qu H (2017) Understanding hidden memories of recurrent neural networks. In: 2017 IEEE conference on visual analytics science and technology (VAST), pp 13–24. https://doi.org/10.1109/VAST.2017.8585721
Malhotra P, Vig L, Shroff GM, Agarwal P (2015) Long short term memory networks for anomaly detection in time series. In: The European symposium on artificial neural networks
Li P, Rao X, Blase J, Zhang Y, Chu X, Zhang C (2019) Cleanml: a benchmark for joint data cleaning and machine learning [experiments and analysis]. CoRR arXiv:1904.09483
Stonebraker M, Rezig EK (2019) Machine learning and big data: what is important? IEEE Data Eng Bull 42:3–7
Tran KP, Nguyen HD, Thomassey S (2019) Anomaly detection using long short term memory networks and its applications in supply chain management. IFAC-PapersOnLine 52(13):2408–2412. https://doi.org/10.1016/j.ifacol.2019.11.567
Zhang R, Zou Q (2018) Time series prediction and anomaly detection of light curve using LSTM neural network. J Phys Conf Ser 1061:012012. https://doi.org/10.1088/1742-6596/1061/1/012012
Geiger A, Liu D, Alnegheimish S, Cuesta-Infante A, Veeramachaneni K (2020) Tadgan: time series anomaly detection using generative adversarial networks. arXiv preprint arXiv:2009.07769
Mozer MC, Kazakov D, Lindsey RV (2018) State-denoised recurrent neural networks. ArXiv arXiv:1805.08394
De Sa C, Feldman M, Ré C, Olukotun K (2017) Understanding and optimizing asynchronous low-precision stochastic gradient descent. SIGARCH Comput Archit News 45(2):561–574. https://doi.org/10.1145/3140659.3080248
Hundman K, Constantinou V, Laporte C, Colwell I, Söderström T (2018) Detecting spacecraft anomalies using lstms and nonparametric dynamic thresholding. CoRR arXiv:1802.04431
Bontemps L, Cao VL, McDermott J, Le-Khac N (2017) Collective anomaly detection based on long short term memory recurrent neural network. CoRR arXiv:1703.09752
Ahmad S, Lavin A, Purdy S, Agha Z (2017) Unsupervised real-time anomaly detection for streaming data. Neurocomputing 262:134–147. https://doi.org/10.1016/j.neucom.2017.04.070
Al-Qaness MAA, Ewees AA, Thanh HV, AlRassas AM, Dahou A, Elaziz MA (2023) Predicting Co\(_2\) trapping in deep saline aquifers using optimized long short-term memory. Environ Sci Pollut Res Int 30(12):33780–33794. https://doi.org/10.1007/s11356-022-24326-5
Cholakov R, Kolev T (2021) Transformers predicting the future. applying attention in next-frame and time series forecasting. CoRR arXiv:2108.08224
Kim N, Lee H, Lee J, Lee B (2021) Transformer based prediction method for solar power generation data. In: 2021 International conference on information and communication technology convergence (ICTC), pp 7–9. https://doi.org/10.1109/ICTC52510.2021.9620897
Zhou H, Zhang S, Peng J, Zhang S, Li J, Xiong H, Zhang W (2020) Informer: beyond efficient transformer for long sequence time-series forecasting. In: AAAI conference on artificial intelligence. https://api.semanticscholar.org/CorpusID:229156802
Shabani A, Abdi A, Meng L, Sylvain T (2023) Scaleformer: iterative multi-scale refining transformers for time series forecasting
Al-qaness MAA, Dahou A, Ewees AA, Abualigah L, Huai J, Abd Elaziz M, Helmi AM (2023) Resinformer: residual transformer-based artificial time-series forecasting model for pm2.5 concentration in three major chinese cities. Mathematics 11(2):476
Woo G, Liu C, Sahoo D, Kumar A, Hoi SCH (2022) Etsformer: exponential smoothing transformers for time-series forecasting. CoRR arXiv:2202.01381
Mozer MC, Kazakov D, Lindsey RV (2018) State-denoised recurrent neural networks. CoRR arXiv:1805.08394
Gao Z, Shi X, Wang H, Zhu Y, Wang Y, Li M, Yeung D-Y (2022) Earthformer: exploring space-time transformers for earth system forecasting. ArXiv
Zhou T, Ma Z, Wen Q, Wang X, Sun L, Jin R (2022) Fedformer: frequency enhanced decomposed transformer for long-term series forecasting. ArXiv
Zhang Y, Yan J (2023) Crossformer: transformer utilizing cross-dimension dependency for multivariate time series forecasting. In: International conference on learning representations. https://api.semanticscholar.org/CorpusID:259298223
Ng A (2021) Data-centric AI competition
Kumar B, Sunil, Yadav N (2023) A novel hybrid model combining SARMA and lSTM for time series forecasting. Appl Soft Comput 134:110019. https://doi.org/10.1016/j.asoc.2023.110019
Xue S, Chen H, Zheng X (2022) Detection and quantification of anomalies in communication networks based on lSTM-ARIMA combined model. Int J Mach Learn Cybern 13(10):3159–3172
Jeong S, Ferguson M, Law K (2019) Sensor data reconstruction and anomaly detection using bidirectional recurrent neural network. SPIE, Bellingham, p 25
Nguyen HD, Tran KP, Thomassey S, Hamad M (2021) Forecasting and anomaly detection approaches using lSTM and ISTM autoencoder techniques with the applications in supply chain management. Int J Inf Manag 57:102282. https://doi.org/10.1016/j.ijinfomgt.2020.102282
Maya S, Ueno K, Nishikawa T (2019) DLSTM: a new approach for anomaly detection using deep learning with delayed prediction. Int J Data Sci Anal 8:137–164. https://doi.org/10.1007/s41060-019-00186-0
Laptev N, Yosinski J, Li LE, Smyl S (2017) Time-series extreme event forecasting with neural networks at uber. In: International conference on machine learning, vol 34, pp 1–5
Kamyab M, Liu G, Adjeisah M (2021) Attention-based CNN and BI-lSTM model based on TF-IDF and glove word embedding for sentiment analysis. Appl Sci 11(23):11255. https://doi.org/10.3390/app112311255
Jim K, Horne B, Giles C (1994) Effects of noise on convergence and generalization in recurrent networks. In: Tesauro G, Touretzky D, Leen T (eds) Advances in neural information processing systems, vol 7. MIT Press, Cambridge
Krishnan S, Wang J, Wu E, Franklin MJ, Goldberg K (2016) Activeclean: interactive data cleaning for statistical modeling. Proc VLDB Endow 9(12):948–959. https://doi.org/10.14778/2994509.2994514
Frenay B, Verleysen M (2014) Classification in the presence of label noise: a survey. IEEE Trans Neural Netw Learn Syst 25(5):845–869. https://doi.org/10.1109/TNNLS.2013.2292894
Oliveira JRD, Lima ERD, Almeida LMD, Wanner L (2021) Improving sensor data quality with predictive models, pp 735–740. https://doi.org/10.1109/WF-IoT51360.2021.9595020
Baptista A, Baghoussi Y, Soares C, Mendes-Moreira J, Arantes M (2021) Pastprop-RNN: improved predictions of the future by correcting the past. CoRR arXiv:2106.13881
Hochreiter S (1998) The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int J Uncertain Fuzziness Knowl Based Syst 6(02):107–116
Strobelt H, Gehrmann S, Huber B, Pfister H, Rush AM (2016) Visual analysis of hidden state dynamics in recurrent neural networks. CoRR arXiv:1606.07461
Strobelt H, Gehrmann S, Pfister H, Rush AM (2017) Lstmvis: a tool for visual analysis of hidden state dynamics in recurrent neural networks. IEEE Trans Vis Comput Graph 24(1):667–676
Dong X, Rekatsinas T (2018) Data integration and machine learning: a natural synergy. Proc VLDB Endow 11:2094–2097. https://doi.org/10.14778/3229863.3229876
Whang SE, Roh Y, Song H, Lee J-G (2021) Data collection and quality challenges in deep learning: a data-centric AI perspective. arXiv preprint arXiv:2112.06409
Bowerman BL, O’Connell RT (1993) Forecasting and time series: an applied approach. Duxbury Press, New York
Graves A (2012) Long short-term memory. Springer, Berlin, Heidelberg, pp 37–45
Lavin A, Ahmad S (2015) Evaluating real-time anomaly detection algorithms–the numenta anomaly benchmark. CoRR arXiv:1510.03336
Hyndman RJ, Koehler AB (2006) Another look at measures of forecast accuracy. Int J Forecast 22(4):679–688. https://doi.org/10.1016/j.ijforecast.2006.03.001
Diebold FX, Mariano RS (2002) Comparing predictive accuracy. J Bus Econ Stat 20(1):134–144
Acknowledgements
This work has been partially funded by the SONAE IM LAB@FEUP, under a research Ph.D. project funded by Inovretail, by projects AISym4Med (101095387) supported by Horizon Europe Cluster 1: Health, ConnectedHealth (n. 46858), supported by Competitiveness and Internationalisation Operational Programme (POCI) and Lisbon Regional Operational Programme (LISBOA 2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund (ERDF), and NextGenAI - Center for Responsible AI (2022-C05i0102-02), supported by IAPMEI, as well as by FCT plurianual funding for 2020–2023 of LIACC (UIDB/00027/2020_UIDP/00027/2020). We also acknowledge the contribution of Hugo Lopes from Inovretail to this work.
Funding
Open access funding provided by FCT|FCCN (b-on). This work has been partially funded by the SONAE IM LAB@FEUP, under a research Ph.D. project funded by Inovretail, by projects AISym4Med (101095387) supported by Horizon Europe Cluster 1: Health, ConnectedHealth (n. 46858), supported by Competitiveness and Internationalisation Operational Programme (POCI) and Lisbon Regional Operational Programme (LISBOA 2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund (ERDF), and NextGenAI - Center for Responsible AI (2022-C05i0102-02), supported by IAPMEI, as well as by FCT plurianual funding for 2020–2023 of LIACC (UIDB/00027/2020_UIDP/00027/2020). We also acknowledge the contribution of Hugo Lopes from Inovretail to this work.
Author information
Authors and Affiliations
Contributions
Authors have contributed equally to this work, except for coding that the first author did.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Baghoussi, Y., Soares, C. & Mendes-Moreira, J. Corrector LSTM: built-in training data correction for improved time-series forecasting. Neural Comput & Applic 36, 16213–16231 (2024). https://doi.org/10.1007/s00521-024-09962-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-024-09962-x