1 Introduction

Recurrent neural networks (RNNs) form the foundation of sequence learning. In a diverse array of machine learning applications that necessitate sequence modeling, variations of RNNs, such as long short-term memory (LSTM) and gated recurrent units (GRU), have demonstrated their efficacy in capturing long-term dependencies. LSTM and GRU have found success in applications ranging from natural language processing [1] and text classification [2] to speech recognition [3] and forecasting ([4, 5]). In the realm of forecasting, RNNs have emerged as formidable contenders against traditional statistical methods [6, 7], particularly highlighted by their impressive performance in winning the M4 competition [8].

At the heart of the RNN’s functionality lies its ability to propagate information from past observations to future ones through internal states. This capability has enabled RNNs to excel in capturing nonlinear patterns within data [9]. Long short-term memory (LSTM), a sophisticated variant of RNN, employs gate mechanisms [10], enhancing the network’s aptitude for encoding long-term dependencies. The LSTM cell generates two states: a hidden state and a cell state. The cell state represents a cumulative memory of the LSTM network across multiple time steps, making it a repository for preserving long-term information [11]. We posit that understanding the dynamics of cell states (CSD) could furnish valuable insights into the characteristics of the learned data. Consequently, we investigate the feasibility of utilizing CSD to identify learning process problems within the training data. In contrast to conventional RNNs, which essentially possess read-only access to data, we introduce a novel RNN variant endowed with both read-and-write privileges.

Existing literature reveals a schism between researchers focusing on improving algorithmic learning capacity and those dedicated to enhancing data quality. Some researchers have leveraged the model’s feedback to rectify data discrepancies (e.g., employing prediction errors [12]). However, these approaches generally treat model learning and data preprocessing as separate pursuits.

To bridge this gap, we introduce a novel paradigm known as read & write machine learning (RW-ML). RW-ML augments traditional read-only models with the capability to not only learn from data but also dynamically modify it when required. This paradigm shift opens up new avenues for enhancing the adaptability and performance of machine learning models, particularly in dynamic and evolving environments. We propose a fresh variant of recurrent neural networks (RNNs) called the corrector long short-term memory (cLSTM). The principal objective of cLSTM is to seamlessly integrate data preprocessing into the learning process. By harnessing the cell and hidden states inherent in LSTM models, we hypothesize that these states furnish valuable insights for identifying learning process aberrations (see: Fig. 1). Such insights empower the model to dynamically adjust the data, optimizing the learning process and consequently refining predictive outcomes. Through astute utilization of LSTM’s cell and hidden states, cLSTM optimizes data processing and learning, culminating in heightened predictive accuracy.

Fig. 1
figure 1

Cell states (top) and time series (bottom) for LSTM size equals 12

Extensive experiments on benchmark datasets, including the Numenta Anomaly Benchmark (NAB) and M4 competition dataset, validate cLSTM’s effectiveness in forecasting and anomaly detection tasks, demonstrating its superiority over traditional read-only LSTM models and hierarchical temporal memory, respectively.

The main contributions of the paper are summarized as follows:

  1. 1.

    Introduce the read & write machine learning paradigm.

  2. 2.

    Utilization of cell states for detecting learning problems.

  3. 3.

    Introduction of cLSTM, a RW-ML model leveraging internal states for dynamic data adjustment.

  4. 4.

    Empirical validation through experiments on NAB and M4 competition datasets.

  5. 5.

    Superiority of cLSTM over traditional LSTM and hierarchical temporal memory in forecasting and anomaly detection tasks, respectively.

2 Background

In the landscape of machine learning research, a distinct disparity has existed, wherein the emphasis on enhancing models has often overshadowed the crucial importance of data collection and data quality considerations [13]. This discrepancy is palpable in the industry, where a disproportionate allocation of 90% of machine learning research efforts is directed toward refining algorithms, leaving a mere 10% for data preparation. Some argue that this distribution should be inverted [14]. Scholarly inquiries into data preparation have given rise to various approaches applied prior to the learning process. These methodologies are often categorized as anomaly detection [12, 15,16,17], denoising [18], and concept drifts [19].

Many of these endeavors rely on prediction errors. For instance, [12] and [20] employ LSTM models to predict future time steps and identify substantial deviations from these predictions. Discrepancies between observed and predicted values are compared to a threshold, determining the presence of faulty inputs if the disparity surpasses. A similar approach has been extended to collective anomaly detection [21]. Moreover, [22] harnesses the hidden states of the hierarchical temporal memory algorithm to compute deviations. [23] introduced AOSMA-LSTM, an LSTM that integrates an enhanced version of the Aquila optimizer (AO) algorithm with the search mechanisms of the slime mold algorithm (SMA).

Recent research has explored the use of transformers in time-series forecasting, with a focus on improving both accuracy and efficiency. [24] and [25] both investigated the application of transformer-like models in time-series data, with [24] concentrating on next-frame prediction and [25] on solar power generation data. In [26], the authors proposed Informer, an efficient transformer-based model for long sequence time-series forecasting. In [27], the authors introduced Scaleformer, a multi-scale refining transformer, which iteratively refines forecasted time series at multiple scales, achieving significant performance improvements. Moreover, [28] introduced ResInformer, a novel transformer-based approach tailored for forecasting PM2.5 concentration in major Chinese cities. It leverages an improved Informer architecture with attention distillation and residual block-inspired structures. Through extensive evaluation using 98 months of air quality index datasets, ResInformer outperformed Informer, showcasing its potential for accurate and efficient air pollution forecasting in urban environments. Similarly, [29] introduced ETSformer, a time-series transformer architecture that incorporates exponential smoothing to enhance forecasting. Additionally, [30] proposed state-denoised recurrent neural networks, aiming to denoise sequential data for improved modeling. A series of transformer-based models has been proposed for time-series forecasting, each with its unique features. [31] introduces Earthformer, a space-time transformer for Earth system forecasting. Furthermore, [32] enhances the performance of the transformer for long-term series forecasting with FEDformer, combining the transformer with the seasonal-trend decomposition method. [33] focuses on multivariate time-series forecasting, proposing Crossformer, a transformer-based model that captures both cross-time and cross-dimension dependencies.

Recent developments have signaled a shift toward data-centric AI [34], where the primary objective is to utilize machine learning to enhance data quality, ultimately leading to improved model accuracy. Early research efforts have explored hybrid approaches aiming to leverage the intrinsic interplay between data and predictive performance. For instance, [35] and [36] proposed a hybrid model integrating an LSTM network with the Seasonal Autoregressive Integrated Moving Average (SARIMA) model. Similarly, [12, 15, 20, 37, 38] utilized LSTM, bidirectional LSTM (BiLSTM), and LSTM autoencoders as preprocessing steps to detect anomalies in the data. These methods rely on discrepancies between predicted and observed values, quantified as prediction errors. Variants of recurrent neural networks (RNNs) like delayed LSTM (dLSTM) [39] have also harnessed predictive errors from normal data to uncover anomalies. Autoencoder-based techniques [38, 40] have also been employed for anomaly detection. Similarly, [41] introduced an attention-based model named ACL-SA, which combines Convolutional Neural Networks (CNNs) with long short-term memory (LSTM) networks for enhancing data in text classification tasks. Numerous methodologies have demonstrated that neural networks display reduced sensitivity to minor levels of noise [19]. These studies suggest that noise can even contribute to improved generalization and convergence [42]. However, this principle does not universally apply to all forms of noise, such as non-white noise in input data [43] or inaccuracies in labels [44]. In scenarios where an anomaly is detected, various correction techniques involve substituting faulty data with predicted sequences, combining the two [45], or employing generated sequences (e.g., TAD-GAN [17]). It is essential to recognize that faults are often categorized as anomalies, missing data, or noise. However, beyond these classifications, other types of errors, including those identified by the learning process itself, can profoundly impact the learning process. A recent approach [46] advocates for detecting prediction errors during the learning process. This strategy allows data to be processed using the model’s feedback, opening up promising avenues for data-centric AI and its potential to further enhance model performance.

2.1 Long short-term memory

Long Short-Time Memory (LSTM) is a type of recurrent neural network (RNN) specifically designed to capture and model long-term interactions within sequential data. In contrast to feed-forward neural networks, RNNs possess recurrent connections that enable them to learn from sequential input, making them adept at handling time-dependent patterns. However, traditional RNNs are prone to the issue of vanishing and exploding gradients [47], which hinders their ability to effectively learn long-term dependencies.

To overcome this limitation, LSTM incorporates a gating mechanism [10], which involves the addition of three gates to the standard RNN architecture: input, output, and forget gates. These gates are realized through specific mathematical transformations (Eqs. 1b to 1d), resulting in the creation of two essential states: hidden states, denoted as h (Eq. 1g), and cell states, denoted as c (Eq. 1f). These states enable the LSTM network to retain relevant information from previous time steps and seamlessly combine it with the current input.

As data is fed into the LSTM network, the input gate, implemented with a sigmoid function, determines which information is essential and should be retained. Simultaneously, the forget gate, also using a sigmoid activation, decides which information from the previous state should be discarded. The output gate, yet another sigmoid function, regulates which parts of the combined information should be exposed to the subsequent layers of the network. Through this gating mechanism, LSTM effectively manages the flow of information, allowing it to capture long-term dependencies and learn intricate temporal patterns within sequential data.

At each time step t, an LSTM cell takes an input \(x_t\) along with the previous hidden state \(h_{t-1}\) and other internal cell data to compute the next hidden and memory states through feed-forward and recurrent connections. Consequently, the LSTM cell generates both a hidden state (\(h_t\)) and a cell state (\(c_t\)) at each time step. The hidden state \(h_t\) can serve as the final output of the network or be passed to another LSTM cell if additional sequences need to be learned. Typically, the last hidden vector is fed through a linear, fully connected layer to produce predictions in sequence learning tasks.

$$\begin{aligned} i_t&= \sigma (W_i \cdot h_{t-1} + V_i \cdot x_t + b_i) \end{aligned}$$
(1a)
$$\begin{aligned} o_t&= \sigma (W_o \cdot h_{t-1} + V_o \cdot x_t + b_o) \end{aligned}$$
(1b)
$$\begin{aligned} f_t&= \sigma (W_f \cdot h_{t-1} + V_f \cdot x_t + b_f) \end{aligned}$$
(1c)
$$\begin{aligned} \hat{C_t}&= tanh (W_c \cdot h_{t-1} + V_c \cdot x_t + b_c) \end{aligned}$$
(1d)
$$\begin{aligned} C_t&= i_t \cdot \hat{C_t} + f_t \cdot C_{t-1} \end{aligned}$$
(1e)
$$\begin{aligned} h_t&= o_t \cdot tanh (C_t) \end{aligned}$$
(1f)
$$\begin{aligned} z_t = h_t \end{aligned}$$
(1g)

3 Cell states dynamics analysis

The fundamental premise underlying the proposed method, cLSTM, revolves around the concept of cell state dynamics (CSD). In conventional LSTM networks, data-related information is captured and stored within its internal states. These states are then utilized to predict future values, under the assumption that the information contained is accurate and pertinent. However, input data may contain observations that can mislead the model’s learning and subsequent predictions. This scenario arises when the training process attempts to comprehend intricate data components while incorporating a subset of observations that might not align with the model’s requirements. This situation can have adverse effects on the model, potentially leading to suboptimal data modeling outcomes.

CSD pertain to alterations in the attributes of cell states over time [48]. In an LSTM architecture, the dimensionality of cell states corresponds to the LSTM size. In simpler terms, at each time step, a cell produces a cell state vector of a length equivalent to the number of hidden units. These cell states can be organized into a concatenated signal (\(c_1,...,c_T\)). In [49], the authors used similar approach for interpretability. The changes over time of the cell states or hidden states are activation signals of the model. When an LSTM network with N hidden units processes a time series, N values are produced at each time step, resulting in a concatenated vector highly representative of the seasonality and information captured from the input sequence and previous sequences.

Fig. 2
figure 2

Plot-Top compares the LSTM cell states and cLSTM cell states. Plot-bottom compares the original time series and time series produced by cLSTM. The gray interval on top shows the cell states of its corresponding data in the gray interval at the bottom

Fig. 3
figure 3

Plot-Top compares the LSTM cell states and cLSTM cell states. Plot-bottom compares the original time series and time series produced by cLSTM

Figure 2 illustrates the cell states learned from a time-series sequence using both LSTM and cLSTM. This depiction is corroborated by the corresponding time series shown in Fig. 2. Notably, the cell states displayed in Fig. 2exhibit minimal variation outside the delineated gray interval. In contrast, Fig. 2 depicts partially altered time-series data.

Cell states serve as representations of processed information extending from the starting of a sequence up to the current time step. Over the course of time, these states dynamically evolve in response to input data and its influence on the learning process. Nevertheless, instances where the input deviates unexpectedly or lacks relevance to the model’s objective prompt the LSTM to adapt its learning behavior. Consequently, this adaptation triggers changes in the cell states, as highlighted in Fig. 3. By scrutinizing these cell states, it becomes feasible to identify instances where the learning process deviates from anticipated behavior, thereby facilitating the detection of issues in the learning process.

4 Corrector LSTM

Incorporating data preprocessing tasks into machine learning processes enhances the accuracy and usability of analytical systems [50]. In addition to this, aligning data preprocessing with the learned model can result in accuracy gains. Neglecting model feedback in a preprocessing stage may lead to erroneous changes to the data [51]. To address this concern, we propose a solution that involves preprocessing the data in the learning process. We introduce corrector long short-term memory (cLSTMFootnote 1). The core principle underlying cLSTM revolves around integrating preprocessing into the learning process itself, recognizing the crucial influence of model feedback on effective data preprocessing. The cLSTM framework involves two key stages: detection and correction, as illustrated in Fig. 4.

Fig. 4
figure 4

cLSTM: An LSTM that iteratively detects issues in the learning process and corrects the corresponding inputs accordingly

4.1 Detection component

The detection component (DC) uses cell state dynamics produced by LSTM to identify abnormal learning behaviors. This component analyzes cell states as follows:

Let \( \mathcal {S} =(x_1,x_2,\cdots ,x_T) \) be a sequence drawn from a distribution \(\mathcal {D}{data}\). At each iteration, an observation \(x_t\) along with a hidden state from the previous observation \(h_{t-1}\) is processed by an LSTM cell, which uses that to update all its units through a forward pass and computes the error vector for all its weights through a backward pass. Finally, it produces a new cell state \(c_t\), which allows us to get the new hidden states \(h_t\). For a sequence of length k, the model produces k cell states over the range \([t-k, t]\) which can be represented as follows:

$$\begin{aligned} \overset{t}{\underset{p=t-k}{C}} = i_p \cdot \hat{C_p} + f_p \cdot C_{p-1} \end{aligned}$$
(2)

Where \(f_p\) and \(i_p\) are the forget and input gates, respectively. \(f_p\) determines what information should be kept from the previous time step, while \(i_p\) decides which information should be kept from the current time step. \(\hat{C_p}\) determines the new cell state candidates.

At a given epoch \(e_{i}\), a concatenation of the cell states (i.e., Equation 2) of all time step is used to train a Seasonal ARIMA model (SARIMA). During epoch \(e_{i+1}\), SARIMA forecasts the values of the cell states at each time step while LSTM is learning. These forecasts are compared to the cell states produced by LSTM in the same epoch. The comparison is performed using the Euclidean distance to quantify the difference between the forecasted and the actual cell states. A threshold \(\eta \) is employed to detect changes in the cell states: if the similarity measure exceeds \(\eta \), cLSTM identifies an issue in the learning process, prompting a modification to the corresponding input (see: Fig. 5). The choice of SARIMA is due to the stationary nature and high seasonality of the cell states. It is important to let the model train for a number of epochs, so the data is well represented in the cell states.

Fig. 5
figure 5

LSTM cell states of three-time steps

4.1.1 Seasonal ARIMA

Seasonal ARIMA (SARIMA) is an adaptive ARIMA model used when the time-series exhibits seasonal variation. ARIMA is defined using (p,d,q) parameters, also called the ARIMA order. d is the level of differencing, p is the autoregressive order, and q is the moving average order [52]. The ARIMA model is defined in Eq. 3.

$$\begin{aligned} \begin{aligned} z_t&= \delta + \phi _1 z_{t-1} + \phi _2 z_{t-2} + \ldots + \phi _p z_{t-p} \\&\quad + a_t - \theta _1 a_{t-1} - \theta _2 a_{t-2} - \ldots - \theta _q a_{t-q} \end{aligned} \end{aligned}$$
(3)

where \(z_t\) is the level of differencing, the constant is denoted by \( \delta \), while \(\phi _i\) is an autoregressive operator, \(a_i\) is a random shock corresponding to time period t, and \(\theta _i\) is a moving average operator.

SARIMA adds to ARIMA an order \((\textit{P},\textit{D},\textit{Q})_s\) which corresponds to seasonal autoregressive (P) and a seasonal moving average notation (Q). The variable s indicates the length of the seasonal period. For example, a sequence with a seasonal period of 20 observations would have, \(s=20\).

The SARIMA order selection process is automated using the auto_arima package, which conducts a grid search over a range of parameters (\( p \), \( d \), \( q \)) for the non-seasonal component and (\( P \), \( D \), \( Q \)) for the seasonal component. By exploring various combinations of these parameters, the package identifies the optimal SARIMA model for the given time-series data. Additionally, the parameter \( m \) is set to the size of the LSTM, representing the length of the LSTM cell states produced at each timestamp. This automated approach streamlines the SARIMA order selection process, enabling efficient and data-driven forecasting.

4.1.2 Similarity measure

cLSTM detects problems in the learning process by predicting the values of the cell states and comparing them to the observed ones. Therefore, it is necessary to measure the similarity between these two signals. Thus, the Euclidean point-by-point mapping approach has been used Eq. 4.

$$\begin{aligned} \textit{d} = \sqrt{\left( {x_1 - x_2 } \right) ^2 + \left( {y_1 - y_2 } \right) ^2 } \end{aligned}$$
(4)

If an erroneous behavior occurs at time step t, depending on the threshold, cLSTM can trigger the correction component (see: Fig. 4). Similarly to constant error flow [53], a constant data error (CDE) flows gradually over time. CDE can be represented as follows:

$$\begin{aligned} C_{t} = \overset{t}{\underset{p = t-k}{\sum }\ } C_{t,p+1} + \epsilon _{p} \end{aligned}$$
(5)

Where \( \epsilon _{p} \) represents the CDE caused by the learned weights related to disrupting inputs from the previous iteration.

The detection threshold in cLSTM serves as a criterion for determining whether the disparity between the estimated cell states and the actual cell states generated by the LSTM cell is significant. If the calculated distance surpasses this threshold, indicating substantial dissimilarity, a correction process is initiated. During this phase, the model iterates through various adjustments to the input data, prompting the LSTM to generate new cell state signals for comparison with the estimated ones. This iterative correction process continues until the distance between the estimated and actual cell states falls below the correction threshold, signifying successful learning. Upon meeting this criterion, the model proceeds to process the next input.

4.2 Correction component

The Correction Component (CC) reconstructs each input associated with issues detected by the detection component (DC) in the learning process. The model states are preserved at each time step. Subsequently, the model is loaded prior to modifying the data.

In the data correction process, the correction component initially modifies the data using an update value parameter, denoted as \( \alpha \), such as \( \alpha = 0.1 \), selected based on the dataset’s scaling. If the scaling function used is standardization, employing the default update value of 0.1 can be a good choice. However, if the normalization method employed is Min–Max scaling, we preferentially reduce the update value to 0.01, as an example. The correction component monitors the LSTM model’s cell states following the data adjustment. Through an iterative process, the correction component continually updates the input using \( \alpha \) in the same direction if the cell states align with the estimated states until the desired correction level is reached. This ensures the data conforms more accurately to the learning problem, enhancing the model’s ability to capture underlying patterns and relationships.

Once the data has been altered, the model is retrained on it, and the model states are once again preserved. This cycle is repeated until the disparity between the SARIMA forecast and the updated cell states becomes less than a designated reconstruction threshold \(\delta \). The corresponding objective function can be defined in Eq. 6.

$$\begin{aligned} \min \textit{d}(C_t, S_t) \end{aligned}$$
(6)

Where \(S_t\) is the SARIMA forecast and \(C_t\) is the actual cell states at time step \(\textit{t}\).

Fig. 6
figure 6

A cell states dynamics before (left) and after its reconstruction (right)

We use \( \eta \) and \( \delta \) as detection and correction thresholds (see: Fig. 4), and \( \delta < \eta \). The choice of the detection threshold \( \eta \) and correction threshold \( \delta \) strongly depends on the scaling method used. In the context of LSTM models, the cell states are influenced by the input data and the weights learned during training. The scaling method applied to the input data affects the values of the cell states because the input data directly influences the cell state values through the input gates, forget gates, and output gates of the LSTM units. For instance, if the input data is standardized, the values fed into the LSTM units are within a certain range, typically around 0 with a spread determined by the standard deviation. As the LSTM units process the input sequence and update their cell states over time, the values of the cell states tend to stay within a similar range, which in our experiments appears to be between -1 and 1 due to our choice of standardization as the scaling method.

Once Eq. 6 is solved, we can confirm that \(\hat{X_t}\) does not affect the cell states behavior. Figure 6 shows cell states before and after the reconstruction.

5 Empirical study

We conduct a comparative analysis between cLSTM and LSTM using multiple time series. Our experimental setup involves utilizing univariate time series as training data. To predict multi-time step labels during the learning process, we leverage multi-time step samples. Upon completion of the training, the outcomes encompass the network’s weights and the refined time series. Given that the core objective of cLSTM is to enhance data quality, we also assess the algorithm’s proficiency in identifying and rectifying anomalies using the Numenta Anomaly Benchmark. Lastly, we provide insights into the computational overhead.

5.1 Research questions

We delineate the following pivotal research questions as the focal points of our study:

  1. 1.

    Does cLSTM perform better than LSTM in time series used in time-series forecasting research?

  2. 2.

    Does cLSTM perform better than LSTM in time series used for anomaly detection research?

  3. 3.

    Can cLSTM be used for anomaly detection?

6 Experimental setup

LSTM is the prime model for comparison with cLSTM. Thus, our study is exclusively centered on the head-to-head analysis of cLSTM and LSTM. Both models, LSTM and cLSTM, were evaluated using the same datasets and a fixed seed. Moreover, for anomaly detection task, cLSTM has been compared to hierarchical temporal memory networks.

6.1 Datasets

We used an extensive dataset containing 1051 time-series, representing various applications. Within this dataset, 995 time-series were sourced from the renowned Macro M4 competition datasets. The Macro M4 dataset comprises six subsets, of which we chose five containing 199 time-series each. Additionally, we incorporated 55 time-series from Numenta Anomaly Benchmark (NAB) dataset. The NAB dataset features different types of time-series from various areas [54]. These time-series come from different sources and include different levels of anomalies. Each time-series in the NAB dataset is labeled accordingly. NAB’s categories include Artificial data Without anomalies (AWt), Artificial data With anomalies, realAdExchange (EX), realAWSCloudwatch (AWS), realKnownCause (RKC), realTraffic (Traffic), and realTweets (Tweets). For time-series exceeding 1000 data points, we used the first 500 points to manage computation.

6.2 Models’ parameters

For implementation, we used LSTM developed with PyTorch. The model input sequences are of length 1, generating cell state vectors of length 12 for each unit (given the LSTM size of 12). The model consists of a single layer, and training occurred over 50 epochs. Cell state dynamics were predicted at each timestamp, resulting in a SARIMA forecast length of 12 steps. The training parameters for the cLSTM and LSTM models are compared in Table 1.

Table 1 Training parameters for cLSTM and LSTM models

In determining the threshold for the disparity between real and predicted cell states, our methodology involves a fine-tuning process. We systematically explore threshold values ranging from 0.6 to 1.4 in increments of 0.2 and evaluate the model’s performance across multiple experiments to identify the optimal threshold. This iterative approach allows us to select the threshold value that maximizes the model’s efficacy in learning problem detection. Additionally, the choice of threshold depends on the range of values exhibited by the cell states. Therefore, it primarily relies on the LSTM parameters and the scaling applied to the input data.

6.3 Evaluation

To ensure an equitable assessment between LSTM and cLSTM, a fixed initial configuration was applied across experiments involving the NAB dataset and each subset of the M4 datasets, for various learning problems detection thresholds. Experiments were conducted using five detection thresholds (\(\eta \)) at 0.6, 0.8, 1, 1.2, and 1.4. The correction threshold (\(\delta \)) was consistently set at 0.2. No fine-tuning of \(\delta \) was carried out, and its impact was not analyzed in this context. Convergence was determined by an early stopping criterion of five iterations, triggered when the model could not reach a distance below \(\delta \). During testing, an 88% training and 12% testing holdout was used. Prior to training, a standardization step was applied. For consistency and fair evaluation, we trained both LSTM and cLSTM models using PyTorch implementations with 50 epochs and a fixed seed.

6.3.1 Metrics

Our models were assessed using the mean absolute scaled error (MASE) [55], ideal for time-series with diverse scales. We also employed a combination of mean squared error (MSE) and the Diebold Mariano statistical test [56] to discern Wins, Draws, and Losses. A Draw implies comparable performance between the two models.

Incorporating both mean squared error (MSE) and mean absolute scaled error (MASE) is crucial to comprehensively assess learning performance. MSE effectively identifies significant predictive errors, particularly those stemming from outliers, by giving more weight to larger errors. This allows us to evaluate how well models handle substantial deviations from actual values. On the other hand, MASE provides a comprehensive performance measure that takes into account the scale of the data, facilitating a fair comparison across various forecasting methods. By carefully utilizing both MSE and MASE, we gain a complete understanding of model performance across datasets with varying levels of quality.

6.4 Experimental results

The outcomes of this study are grouped into three sections, each corresponding to one of the three research questions. The experiments encompass two distinct scenarios for forecasting: using the M4 competition and NAB datasets.

6.4.1 cLSTM vs LSTM on M4

An architecture based on LSTM emerged victorious in the M4 competition [8]. This prompted us to conduct an in-depth comparison between cLSTM and LSTM, using datasets that were part of the same competition. The results are presented in Tables 2 and 3, showcasing the Wins for various thresholds, both collectively and individually. The findings underscore cLSTM’s superiority in the Daily, Weekly, and Monthly datasets. This pattern is further highlighted in Fig. 7. Conversely, LSTM demonstrated its prowess in the Hourly and Yearly datasets. This divergence could be attributed to the granularity and distinct patterns within each dataset. These trends persisted across different detection thresholds, with a threshold value of 1.2 proving to be the optimal on average. Furthermore, the increased variability within the Hourly dataset might contribute to the heightened detection of false positive corrections. Conversely, the smaller size of the Yearly sequences limits cLSTM’s exposure to training data, possibly leading to less significant cell states dynamics. An example of an Hourly time series is depicted in Fig. 8. Additionally, fine-tuning the detection threshold \(\eta \) emerges as a pivotal consideration for cLSTM, as it substantiates correction precision (see: Table 3).

Table 2 M4-number of Wins based on MSE for all the thresholds
Table 3 M4-number of Wins based on MSE and Diebold Mariano test by time series
Fig. 7
figure 7

Number of Wins of cLSTM and LSTM over a total of 1000 time series for Hourly, Daily, Weekly, Monthly, and Yearly datasets picked from the M4 competition

For a comprehensive evaluation of cLSTM’s Wins and Losses, we computed average MASE values in both scenarios (see Tables 4 and 5). The MASE values resonate with Wins and Losses, except for Weekly, consistently reflecting lower LSTM’s MASE values. Given the number of Wins in Table 3, we deduce that cLSTM does not elevate model performance beyond LSTM. However, cLSTM’s corrective potential enabled it to mitigate errors that influenced overall MSE across these datasets. Notably, LSTM MASE’s standard deviation is higher in Win cases compared to cLSTM’s in Loss cases. Additionally, the findings suggest that cLSTM fares better in datasets where LSTM’s MASE values are higher. In essence, cLSTM outperforms LSTM in datasets with outliers. Figure 9 illustrates corrections applied to original time series, while the right side displays forecasting outcomes. Another instance in Fig. 8 demonstrates the impact of high number of corrections on forecasts.

Fig. 8
figure 8

Time-series reconstruction example on M4 (Hourly). The left plot shows the original (black) and the corrected (red) time series. The right plot shows the forecast results (color figure online)

Fig. 9
figure 9

Time-series reconstruction example on M4 (Monthly). The left plot shows the original (black) and the corrected (red) time series. The right plot shows the forecast results (color figure online)

Table 4 M4-cLSTM and LSTM average and standard deviation of MASE values on win cases
Table 5 M4-cLSTM and LSTM average MASE values on Loss cases

Regarding Loss cases, both Figs. 10 and 11 exhibit a similar trend. This trend could be attributed to the prevalence of false positives in such instances. Notably, Fig. 10 showcases cLSTM’s superior aptitude in predicting seasonality. Figure 12 portrays a step function time series. Here, cLSTM slightly adjusted training data, leading to a detrimental effect on accuracy. Although we integrated early stopping for situations where cLSTM fails to identify the correct value, further enhancements are necessary to prevent substantial corrections in the wrong direction (refer to Fig. 13).

Fig. 10
figure 10

Time-series reconstruction example on M4 (Daily). The left plot shows the original (black) and the corrected (red) time series. The right plot shows the forecast results (color figure online)

Fig. 11
figure 11

Time-series reconstruction example on M4 (Daily). The left plot shows the original (black) and the corrected (red) time series. The right plot shows the forecast results (color figure online)

Fig. 12
figure 12

Time-series reconstruction example on M4 (Daily). The left plot shows the original (black) and the corrected (red) time series. The right plot shows the forecast results (color figure online)

Fig. 13
figure 13

Time-series reconstruction example on M4 (Daily). The left plot shows the original (black) and the corrected (red) time series. The right plot shows the forecast results (color figure online)

During the training of the corrector long short-term memory (cLSTM) model, we closely monitored the training loss, which consistently decreased over epochs, signifying robust convergence immediately following the implementation of corrections at epoch 50, as chosen in this study (see: Fig. 14). This suggests that the model effectively grasped the underlying patterns of the training data. We also observed a strong correlation between the validation loss and the MSE on the validation dataset. Lower MSE corresponded to lower validation loss, indicating accurate predictions by the cLSTM model on validation data. Conversely, higher MSE values led to higher validation loss, highlighting significant deviations from the ground truth. This correspondence underscores cLSTM’s prowess in minimizing forecasting errors.

Fig. 14
figure 14

Learning curves of LSTM and cLSTM on three different time series

6.4.2 cLSTM vs LSTM on NAB

In pursuit of improved data quality, the logical course involves addressing anomalies. This section delves into the performance of cLSTM and LSTM on data featuring anomalies, exemplified by the Numenta Anomaly Benchmark (NAB) dataset. Table 6 reveals cLSTM’s superiority in 26 time-series instances, parity in 24 cases, and trailing by six cases. Figure 15 provides a glimpse of corrected anomalous data (Left) and its associated forecast (Right). Furthermore, Table 7 lays out MASE values for cLSTM and LSTM. It emerges that cLSTM reduces average errors in three out of seven datasets, equals in 2, and lags in 2. Despite cLSTM’s higher count of Wins in the Tweets category (8 out of 10), MASE values hint at cLSTM’s underperformance compared to LSTM. This could be attributed to cLSTM’s rectification of larger errors, which lowered the MSE values without substantially altering the model’s global performance.

Table 6 Number of Wins based on MSE and Diebold Mariano test by time series on NAB
Fig. 15
figure 15

Time-series reconstruction example on NAB. The left plot shows the original (black) and the corrected (red) time series. The right plot shows the forecast results (color figure online)

Table 7 NAB-MASE values of cLSTM and LSTM

6.4.3 Anomaly detection

Hierarchical temporal memory (HTM) networks, as indicated in [22], demonstrate robustness in identifying anomalies within the Numenta Anomaly Benchmark (NAB) dataset. We have conducted a comparative assessment of both HTM and cLSTM, as detailed in Table 8. The anomaly labels and HTM outcomes were sourced from.Footnote 2 We used a threshold of 0.85 onto the raw HTM score, deeming data points surpassing this threshold as anomalies. Within this framework, true positives (TP) pertain to the number of detected data points accurately labeled as anomalies within NAB. Conversely, false positives (FP) represent the count of detected data points not annotated as anomalies in the NAB dataset. The results vividly illustrate that cLSTM surpasses HTM in robustly detecting TP errors. Both HTM and cLSTM exhibit a relatively high number of FP instances. Given cLSTM’s primary design goal of enhancing learning, the model also identifies other errors associated with the model itself, alongside anomalies. An illustrative depiction of HTM and cLSTM detection outcomes is presented in Fig. 16. In this figure, the original and corrected time series are represented by the black and red lines, respectively. The black dot signifies a data point flagged as an anomaly by HTM. On the other hand, the red cross denotes a data point corrected by cLSTM. Notably, in Fig. 16, we observe that cLSTM, in contrast to HTM, does not alter data solely based on statistical significance (e.g., outliers). Instead, cLSTM takes into account the extent to which these data points influence cell state behavior. For instance, within the interval 50–100, there exists a peak in the time series, identified as an anomaly by HTM, yet not reconstructed by cLSTM. In general, cLSTM exhibits a strong capability to detect and rectify outliers and various anomalies.

Fig. 16
figure 16

cLSTM detection vs HTM detection considering raw score \(> 0.85\). The black and red time series are the original and the corrected time series, respectively. The black dot is the data point that was detected as anomaly by HTM. The red cross is the data point that has been corrected by cLSTM (color figure online)

Table 8 Scoreboard showing anomaly detection results of HTM and cLSTM on v1.1 of NAB. FP stands for the data points that were detected as errors by HTM and cLSTM, but they were not in the labeled anomalies. TP stands for the data points that were detected as errors by the methods and were labeled as anomalies

6.4.4 Computational cost

In this investigation, we calculated the computational time, particularly considering the computational requirements of the detection and correction components, which involve training and forecasting using SARIMA. As a result, an extra computational overhead is incurred in addition to the time taken for LSTM learning (refer to Table 9). These experiments were conducted on a machine equipped with an i7 processor featuring 6 CPU cores @ 2.60 GHz and 16 GB of RAM.

Table 9 Average of additional time needed for one time-series reconstruction (in seconds)

cLSTM incorporates a meta-learner during the detection process, which leads to a longer computational time compared to LSTM. This accounts for the extra time needed by cLSTM over LSTM, as demonstrated in Table 9. The table displays the average additional time required by cLSTM compared to LSTM. These timings correspond to those observed during the experiments conducted in this paper.

7 Limitations and future works

One of the main limitations of cLSTM, rooted in its read & write machine learning (RW-ML) framework, is the computational overhead associated with its reliance on a meta-learner during the detection process. This component is computationally expensive as it requires fine-tuning to find the optimal orders. Future research could explore alternative methods, such as incorporating gating mechanisms directly into the LSTM cell architecture, to mitigate this computational burden and replace the meta-learner. Additionally, optimizing the correction component by integrating optimization search algorithms could enhance the efficiency of data correction processes and prevent the model from making incorrect corrections. Furthermore, investigating the model’s generalization capability across diverse datasets, enhancing the interpretability of cell states dynamics (CSD), and adapting cLSTM to dynamic environments are all promising avenues for future research. Improvements in these areas could further enhance the applicability and democratization of machine learning within the RW-ML paradigm.

8 Conclusions

Recurrent neural networks often face challenges due to data quality issues that impact their learning process. To bolster their resilience and tackle these challenges, the common approach involves using data preprocessing techniques. Alternatively, one can adjust the data during the learning phase to address emerging problems. In our research, we introduce a novel approach named corrector long short-term memory (cLSTM), which leverages the cell states within the LSTM model. Cell states dynamics empowers cLSTM to recognize learning issues and adapt the input data accordingly. This is facilitated by a meta-learner that predicts these states. We verified the effectiveness of cLSTM by conducting comprehensive comparisons with the standard LSTM using the M4 and NAB datasets. The outcomes unequivocally demonstrate that cLSTM enhances the learning process by addressing learning issues occurring in the cell states dynamics. Throughout the study, cLSTM outperforms LSTM, especially in improving forecasting accuracy when LSTM struggles with errors. Moreover, comparisons with both HTM and cLSTM on the NAB dataset highlight cLSTM’s potential to improve data quality, especially in identifying anomalies. To conclude, our findings highlight cLSTM’s ability to enhance forecasting accuracy, potentially transforming the learning paradigm from read-only to read-and-write models. An interesting direction for future exploration is to investigate if similar improvements can be applied to other neural network architectures beyond LSTM.