1 Introduction

Time series prediction involves finding patterns from past data and using those patterns to make predictions about future events. It is the process of using historical data to forecast future events, playing a crucial role in data science for predicting trends, planning for future occurrences, and making decisions based on expected outcomes. Achieving correct predictions needs a profound understanding of the underlying patterns in the data and how they may evolve over time.

Time series prediction is a process that demands a deep comprehension of the data and its inherent patterns. It is essential to account for the impacts of seasonality, trend, and noise within the data, along with any external factors that might influence it. Additionally, considering the length of the data set is vital, as it directly affects the accuracy of the predictions. A larger dataset enhances generalization [1], reduces overfitting, and ensures statistical significance. However, collecting extensive data isn’t always feasible, so techniques like optimization [2], data augmentation and transfer learning can enhance model performance even with limited data.

The applications of time series data analysis tools are not only limited to engineering research data repositories, where the performance of various engineering devices and tools can be evaluated over time for improvement and accuracy. These approaches are now employed in almost all disciplines of sciences, including bio-medicine [3,4,5,6,7], finance [8,9,10,11], agriculture, industry [12] and most importantly in the domain of climate change that is an evolving area of research [13]. In these fields, the time-dependent data of multiple variables can be efficiently managed and assessed for future predictions [14].

It is important to note that due to the diverse nature of climate change challenges, it is really hard to retrieve detailed information of all the attributes and predictors, therefore the time series forecasting depends on the augmentation techniques. For example, in the event of floods [15], wildfire [16], earthquake [17] and even disasters like pandemics [18], the qualitative data (interviewing the survivors, audio and visual data analysis) is really limited and the analysis is really challenging due to the missing or incomplete data. However, in this manuscript, we will discuss another approach that is comparatively efficient for complex datasets, known as the transfer learning approach [19].

Similarly, machine learning tools have gained recognition in other domains such as finance and business due to the strengths of novel techniques in terms of data and operating characteristics. Over the years, various researchers have discussed the strengths of artificial intelligence (AI) tools through business case-based studies [11, 12, 20,21,22]. Similarly, for the secure online businesses [23,24,25], algorithms have been developed by researchers to identify malicious websites, serving as an initial protective measure.

In all the applications discussed above and in general practice, the statistical datasets exhibit three main qualities: (a) dimension, (b) sparsity, and (c) resolution.

The "dimensionality" of a dataset refers to the total number of characteristics and measurements for each object in the dataset. When a dataset has samples with numerous descriptions, known as "high dimensionality," it can become challenging to discern the meaning of the data. This challenge is often referred to as the "curse of dimensionality."

When most of an object’s features are set to 0, a highly skewed distribution is observed. In many instances, less than 1 percent of the inputs exhibit non-zero values. AI tools consider such raw material as sparse [26], emphasizing the scattered nature of the entire dataset [27].

The third quality relates to the level of resolution in the data structure. If the structural diagram is too limited, the results may not be visible or might be disturbed by noise (see Table 1 for the climate change datasets with noise). Conversely, if the diagram’s result is too extensive, patterns can be obscured. For instance, the motion of storms and other weather phenomena can be observed through changes in atmospheric pressure over an hour. However, on a timescale of months, such patterns may not be discernible.

Table 1 Summary of data aspects

In particular, as the number of dimensions increases, the space occupied by the data becomes more constrained. For classification, this implies that there may not be enough data objects to construct a model exclusively dedicated to a specific category among all possible items. In clustering, the definition of distance and density between points, crucial for clustering, becomes less clear.

1.1 Types of Datasets

The data of research problems addressed above can be categorize into (a) ordered data, (b) record data, and (c) graph-based data.

In most cases, the work of Data Mining relies on record data, which is a set of records (data objects). At the onset of record data processing (data stored in a table), there are typically no authentic links between data fields and records, and each record (object) shares a similar set of features. Records are often stored in horizontal files or relational databases (tables containing rows and columns).

The relation between the characteristics of data sets is based on their order in time or space. There are four types:

  1. 1.

    Sequential information, also known as temporal records, represents an extended form of recorded facts where each record is associated with a specific time. Consider a dataset related to retail transactions that includes both the time and transaction type.

  2. 2.

    Sequence data comprises a collection of items listed in order, such as a sequence of words or letters. While it resembles sequential data, it differs in that positions follow a distinct pattern rather than being marked with timestamps. For instance, genetic instructions in plants and animals are represented as nucleotide sequences, known as genes. For the analysis of such datasets, the readers are recommended to consider works such as [34, 36].

  3. 3.

    Time series data is a specific category of sequential information where each record serves as a point in a time series. For example, the computation of series considered over a specific time interval, such as economic data based on daily stock prices. In this scenario, time series data could be collected over several months, recording daily prices for various stocks, reflecting the daily fluctuations in prices, especially common in underdeveloped countries where grocery prices can vary daily. Some useful references in this domain are  [36, 37].

  4. 4.

    Some objects possess spatial characteristics, indicating their location or size, along with other attributes. An example of spatial data is weather information (precipitation, temperature, and pressure) collected for various locations worldwide. Useful approaches in this domain are listed by researchers [38, 39].

Thus clear understanding of data type is important to select relevant machine learning tool. In the next section, we will discuss the machine learning approaches to analyze time series datasets. We will address the challenges with the help of an example and will extend the research idea with the aid of transfer learning approach.

2 Materials and Methods

With machine learning networks, complex datasets can be explored more efficiently. These networks can further help to address problems such as time series forecasting and risk management.

2.1 Recurrent Neural Networks

Recurrent neural networks (RNNs) are different from the basic feed forward networks. RNNs have the ability to analyze temporal dynamic behaviour by forming a directed graph along a temporal sequence and by using their memory to process these sequences, making them ideal for sequential data like time series, financial data, audio, video, speech, weather, and complex problems. RNNs originated in the 1980 s, but their full strength has only recently emerged.

Fig. 1
figure 1

Recurrent neural network architecture

2.2 Long Short-Term Memory Models

Hochreiter and Schmidhuber [40] introduced Long Short-Term Memory (LSTM) networks in 1996 to address the vanishing gradient problem in traditional RNNs. LSTM architecture mitigates the issue where gradients become extremely small during backpropagation, impeding, or halting the learning process.

To resolve the vanishing gradient problem, LSTM employs a memory cell capable of choosing to forget or remember information over time. The three units, referred to as gating units, control the cell by deciding how much information to forget, remember, and add to the cell as new information

Over time, LSTM architecture has gained acceptance as a superior choice compared to traditional RNNs. The specific design of LSTM has been enhanced and successfully applied to various problems in finance, accounting, and other research and technology fields. Its applications extend to tasks such as speech recognition, natural language processing, medical imaging, bio-medicine, smart energy, and other time-series prediction tasks [41, 42]. With numerous extensions and variations tailored to address diverse challenges, the LSTM approach has become one of the most trusted methods for time series data analysis, particularly with datasets featuring higher frequencies and different attributes [43, 44].

2.2.1 Methodology

These networks includes unique set of reminiscence cells that replace the neurons of the hidden layer of RNN, and the state of the memory cells is also important. LSTM models filter out the information through gate structures to preserve the state of memory cells, and keeps these up to date on regular basis. It has input, output, and forgotten gates in its door structure. Each memory cell consists of 3 sigmoid layers and 1 \(\tanh \) layer. The Fig. 2 shows how LSTM memory cells are put together.

Fig. 2
figure 2

LSTM model architecture

The forget gate within the LSTM unit decides which information about the state of a cell is left out of the model. As shown in Fig. 2, the memory cells take as inputs the previous output, \(h_{t-1}\), and the current moment’s external information, \(x_t\), unified in a long vector, \(\textbf{v} = [h_{t-1}, x_t]\), using the sigmoid function:

$$\begin{aligned} f_t = \sigma (W_f \cdot [h_{t-1}, x_t] + b_f) \end{aligned}$$
(1)

\(W_f\) and \(b_f\) are the gate weight matrix and forgotten bias, and sigmoid feature, respectively. The primary purpose of the forgotten gate is to keep track of how much of the cell state \(C_{t-1}\) is reserved for the current cell state \(C_t\). Based on \(h_{t-1}\) and \(x_t\), the gate sends a range of numbers between 0 and 1. 1 means all reserved values, and 0 means all discarded values.

The input gate determines how much of the current time network entry \(x_t\) is reserved for cell state \(C_t\). This prevents irrelevant data from entering the memory cells. To find the state of the cell that needs to be changed, the sigmoid layer selects the values to be modified. In Eq. 2, the mathematical explanation is presented as:

$$\begin{aligned} i_t = \sigma (W_t \cdot [h_{t-1}, x_t] + b_i) \end{aligned}$$
(2)

Another approach is to replace the records via the \(\tanh \) layer, creating a candidate vector \(\hat{C_t}\) to control the amount of newly added information, as in Eq. 3:

$$\begin{aligned} \hat{C_t} = \tanh (W_c \cdot [h_{t-1}, x_t] + b_c) \end{aligned}$$
(3)

To change the state of the cell in the memory, the function \(C_{t}\) is used, as shown in Eq. 4:

$$\begin{aligned} C_t = f_t *C_{t-1} + i_t *\hat{C_t} \end{aligned}$$
(4)

The output gate essentially controls the information about the percentage of the discarded "current state" of the cell. The output information is first decided by the sigmoid layer. The next step is to alter the element of the cell using \(\tanh \) and multiply the output of the sigmoid layer by the state of the cell to get the result.

$$\begin{aligned} O_t = \sigma (W_{\sigma } \cdot [h_{t-1}, x_t] + b_o) \end{aligned}$$
(5)

The final value of the cell output is given as follows:

$$\begin{aligned} h_t = O_t *\tanh (C_t) \end{aligned}$$
(6)

2.3 Recursive Long Short-Term Memory Models

Recursive LSTMs offer distinct advantages over alternative methods in time series prediction. Primarily, they excel in capturing long-term dependencies in data, making them well-suited for forecasting future events. Moreover, their capacity to learn from their own predictions contributes to continual improvement in accuracy. Lastly, they effectively capture patterns across multiple time steps, a crucial aspect in the realm of time series prediction.

Table 2 Comparison of RNN variants

3 Results and Discussion

3.1 Case Study

Although there are several applications of LSTM, here we will elaborate its significance with the help of a stock price prediction example. In this stock price prediction case study, we employ recursive LSTM models on historical Microsoft Corporation (MSFT) stock data. The process involves data preprocessing, model architecture design, and training for future stock price predictions.

The dataset, sourced from Yahoo Finance (1986-03-14 to 2022-10-07), undergoes normalization, noise removal, and transformation. The model architecture is tailored, specifying layers, neurons, optimizer, and loss function. Training involves feeding, weight adjustments, and accuracy evaluation on a test set.

The dataset is split into training, validation, and test sets, indicated by ‘q80’ and ‘q90.’ Figure 3 visually represents this split, with colors denoting percentages (70%, 15%, and 15%). The legend distinguishes each set in the plot.

This case study demonstrates a systematic approach to recursive LSTM models for stock price prediction, highlighting key steps from preprocessing to model evaluation. Figure 3 provides a concise visual of the dataset split, crucial for assessing model performance.

Fig. 3
figure 3

Splitting of the data for modeling

A plot comparing the model’s training predictions to the actual target values for the training set is provided, making it possible to visually assess the performance of the model on the training data. In Fig. 4, we have provided the model training results.

Fig. 4
figure 4

Model training result

Fig. 5
figure 5

Model validation result

Use the trained model to generate predictions on the testing data.Flatten the predictions to convert the predicted values to a 1D array. Figure 5 provides the model validation results.

Fig. 6
figure 6

Model test results

The test results are presented in Fig. 6 respectively.

The tool will generate a plot that compares the model’s predicted values to the true target values across all three data sets, making it possible to visually assess the model’s overall performance.

Fig. 7
figure 7

Composite results for the detailed features of the model

Results are presented in Figs. 7 and 8, showcasing composite and recursive prediction outcomes. The recursive prediction process involves generating predictions using the trained LSTM model on the validation and testing sets. Predicted values are stored in the ’recursive predictions’ list, visually illustrated when plotted against target dates. The method employs a recursive approach, updating predictions iteratively by replacing the last element in each window with the predicted value.

Fig. 8
figure 8

Recursive prediction results

3.2 Applications to Support Transfer Learning

Developing layers, networks, and classifiers poses significant challenges within the field of machine learning. A substantial subbranch is dedicated to the “reuse” of developed classifiers, often involving the transfer of knowledge gained from training on one dataset to a new problem. These steps have shown promise in developing models in a cost-effective manner.

Over time, several transfer learning approaches have been developed and successfully implemented for complex problems [45,46,47]. Building on the works [48], an algorithm can be designed for the transfer learning approach, utilizing LSTM, to analyze datasets from multiple sources for finance data or for the climate change data respectively. For example, researchers [49] used transfer learning approach and an approach of weighted combination of the available predictors to guarantee the convergence to the best weighted predictor. They mainly focused on the online transfer learning framework for improved temperature predictions in residential buildings. Similarly, for the flood management, researchers proposed transfer learning models [50] for better forecasting. Another fascinating application of transfer learning is proposed by researchers [51], where transfer learning and LSTM approaches were used in a bidirectional manner to address the challenges of missing data problems for building energy.

For more complex situations such as cross-domain knowledge transfer [52] and diverse data sources [53, 54], an improved algorithm can provide a robust framework for efficient analysis and model adaptation. A schematic description (Fig.3) is presented below to illustrate the steps.

The methods reviewed and their implementations described in this work provide readers with insights to develop advanced algorithms.

Fig. 9
figure 9

Research approach motivated by works [12, 19, 48], inspired by factors and predictors of climate change [55], utilizing LSTM algorithm, and bridging transfer learning to forecast challenging research questions

4 Conclusions

The time series forecasting tools have advanced based on the requirement of current research strategies and pathways. The challenges of time series are not limited to stochastics perturbations, it has a greatly influenced by the underlying sources and stressors. Smart programming tools can memorize patterns and trends, thus utilizing these while forecasting the fate of the open problems under consideration. In this manuscript, we have provided a comprehensive overview of machine learning time series data analysis tools. The transfer learning approaches have proved to be promising in this domain. We conclude that LSTM and transfer learning benefit each other likewise. LSTM networks can be pre-trained on a large dataset for a specific task. The learned representations or weights of the LSTM can then be transferred and fine-tuned on a smaller dataset for a related task. This transfer of knowledge from the pre-trained LSTM to the target task can help improve performance, especially when the target dataset is limited. Transfer learning can help LSTM networks by providing a way to shift knowledge from one task to improve performance on another task. By transferring knowledge from the source task to the target task, the LSTM can benefit from the general features learned in the source task. These tools can help not only with the time series data in business and finance but also with other emerging research areas such as climate change and public health. These fields face significant challenges in data management and processing, and these tools can play a crucial role in addressing these challenges.