1 Introduction

In the past 2 decades, people have become increasingly interested in the classification of time series, and an increasing number of scholars have joined the research. Moreover, with the advent of the 5G era, big data are closely related to our lives. Time-series data are found everywhere, especially in the medical, industrial and meteorology fields [1,2,3,4]. In the financial field, because accurate and effective financial time series analysis methods can avoid risks and provide profitable investment strategies for investors, more and more researchers have joined financial time series analysis and research. Financial time series analysis refers to the ability to mine the volatility trend law of financial products through historical data to guide investors’ rational investment. However, because the financial market is affected by many different factors, the data mining of financial products is quite challenging. More and more financial time series analysis methods have also been proposed in recent years. This type of research is usually based on three methods: traditional methods based on traditional statistical analysis methods, prediction-based analysis methods, and reinforcement learning-based analysis methods. The research on future price trends based on historical data of financial products has become an increasingly popular direction, especially the research on financial time series classification.

The time-series classification (TSC) is a critical issue in time-series data mining research. TSC accurately classifies a series of unknown time series according to the known “category” labels in the time series. TSC can be regarded as a “supervised” learning mode in the time series category,. unlike the traditional classification method that only considers numerical attributes, the TSC must consider the order relationship between adjacent samples in the time series. It is the same as financial time series classification problems. Meanwhile, compared with other types of time series, the financial time series has complex, highly noisy, dynamic, nonlinear, and nonlinear characteristics. Therefore, it is more challenging than traditional classification methods [5]. When facing these challenges, using appropriate methods and better models to learn the data characteristics of financial time series to have better performance and higher accuracy in classification performance will be very challenging. Since 2015, hundreds of TSC algorithms have been proposed [6]. Traditional TSC methods based on sequence distance have been proven to achieve the best classification performance in most fields, except for distance-based methods. Additionally, feature-based classification methods have excellent classification performance based on existing good features. However, it is challenging to design good features when faced with a FTS to capture some inherent properties in the series. Although methods based on distance or features are used in many studies, these two methods have resulted in far too many calculations for many practical applications [7]. As many researchers have applied deep learning (DL) methods to TSC, an increasing number of TSC methods have been proposed, especially with the emergence of new deep structures such as residual neural networks and convolutional neural networks (CNN). These methods are applied to image, text, and audio areas, and can also be used for processing time-series data and related analysis. A multivariate long short-term memory fully convolutional network (LSTM-FCN) was proposed for TSC, which further improved the model’s classification accuracy by improving the structure of the full convolution block [8]. JIANG solved the question of imbalanced time series in industrial applications. Jiang proposes a novel anomaly detection approach based on generative adversarial networks (GAN) to overcome this problem [9]. Deng proposed Imputation Balanced GAN (IB-GAN), a novel method that joins data augmentation and classification in a one-step process via an imputation-balancing approach. Empirical experiments show significant performance gains against state-of-the-art GAN baselines [10, 11]. Li proposed an unsupervised multivariate anomaly detection method based on Generative Adversarial Networks (GANs), using the Long-Short-Term-Memory Recurrent Neural Networks (LSTM-RNN) as the base models (namely, the generator and discriminator) in the GAN framework to capture the temporal correlation of time series distributions. The experimental results showed that the proposed MAD-GAN is effective [12].

Inspired by the classification application of DL in the image field, such as generative adversarial networks (GAN), which have achieved remarkable success in generating high-quality images in computer vision, we explore a DL framework for multivariate FTS classification. The model uses convolutional long short-term memory (ConvLSTM) as the generator to learn the distribution characteristics of the data and multilayer perceptron (MLP) as the discriminator to discriminate whether the output data of the generator are true or false. The model is similar to a tiny two-person game. We can roughly divide stock prices into three categories (short, neutral, and long), similar to the following example. For instance, a customer’s rating on a movie might be a do-not-bother, only-if-you-must, good, very-good, and run-to-see. The ratings have a natural order, which distinguishes ordinal regression from general multiclass classification [13]. The idea of ordinal regression has been added to machine learning in many studies. For example, Crammer and Singer [14] used the multi-thresholding of the online perceptron algorithm to perform ordinal regression, which has a similar meaning to our multi-classification of stock price trends [13]. The two situations we wanted to see in the model’s prediction results are as follows: first, predicting a short as long will cause a significant loss in actual trading, and second, forecasting the long as a short, which will cause us to lose the opportunity to profit in this transaction. Therefore, we made improvements to the discriminator in the GAN model for the above two cases. The detailed steps are presented in Sect. 3.3. We evaluated the performance of our model on three publicly available stock datasets and selected several classic comparison methods. The experimental results show that the classification performance of the ordinal regression GAN for financial volatility trends (ORGAN-FVT) model proposed in this study on the three FTS datasets of TSLA, PAICC, and MSFT is significantly improved compared to the competitors, especially the model optimised by adding the ordinal regression idea in the classification results. Moreover, it has practical significance because the model has a better practical reference value and less preprocessing.

We summarise our contributions as follows:

  • We propose an effective GAN-based volatility trend multi-classification model for multivariate FTS based on stock data with multiple technical indicators. To the best of our knowledge, such generative multi-classification model hasn’t been studied much in existing research.

  • We improve the generator of GAN by adopting ConvLSTM to efficiently capture temporal dependencies and exploit ordinal regression in the discriminator to achieve better multi-classification performance.

  • Based on experimental comparison on three real-world stock datasets with state-of-the-art methods, the proposed ORGAN-FVT model outperforms its competitors and leads to a practical performance in financial predictions.

This paper is organised as follows: Sect. 2 reviews the relevant research work. Section 3 introduces the architecture of the proposed ORGAN-FVT model in detail. Section 4 presents the experimental settings and results. Finally, we present our conclusions in Sect. 5.

2 Related Works

The task of time series classification (TSC) is to obtain a classifier through training on the datasets, which can obtain the probability distribution of the input data variable mapped to the label. Time series classification is an important and challenging problem in data mining. Unlike traditional classification methods, time series classification methods have time series characteristics, so the numerical relationship between different attributes needs to be considered when classifying and the order relationship between time series points. From the perspective of time development, we can divide time series classification into early time series classification and deep learning-based classification methods.

Traditional TSC methods have performed well in many scenarios, such as the model-based classification method, and the fitted regression model proposed by Zhang et al. [15,16,17]. These methods first generate a specific model for each time series and then look for the differences between the models. The similarity is used to realise TSC, and it has an excellent classification advantage for time series with solid model adaptability.The early time series classification method based on prefix was first proposed in the paper [18]. The idea is to obtain the MPL (Minimum Prefix Length, MPL) of the time series through a large amount of data training and then classify the test set. Classification methods based on local features, such as the shapelet classification method proposed by Ye and Keogh in 2009 [19], the classification results of the entire sequence are obtained by calculating the similarity between the features of different sub-sequence. TSC algorithms based on global features treat the entire time series as a full feature and classify it by calculating the similarity between the entire time series, such as the distance-based K-nearest neighbour classification method, have proven to have good performance. Moreover, an increasing number of studies have demonstrated that dynamic time warping is the best method for sequence distance measurement in most fields [20,21,22]. The key to the classification methods based on local features is to find local data features with clear classification features [23]. The classifier is more efficient in performing corresponding classification operations because the sub-sequence length that can reflect the local features is much shorter than the entire time series.

The financial time series classification method based on deep learning have been extensively studied. Since DNN (Deep Neural Network, DNN) changed computer vision, the classification method based on deep learning has gradually been applied in time series classification. For example, Michael et al. took the lead in applying RNN to time series classification [24, 25]. Due to its robust feature extraction capabilities, deep convolutional neural networks (DCNN) have been added to the time series classification and combined with other models to achieve good results. Yi et al. explored and transformed CNN (Convolutional Neural Network, CNN) and proposed the MC-DCNN model to improve the classification performance [26, 27]. At the same time, in order to better classify the time series, before training the model, stacked denoising autoencoding (SDAE) are used in pre-training, which can better learn the potential features of the time series model.In the multi-classification problem of price fluctuation trends in financial time series, financial time-series classification (FTC) has significant value for investment managers. Therefore, it has attracted much attention in the past few decades. Kim and Han [28] have proposed feature selection methods based on Genetic Algorithm (GA) combined with a NN model to select useful features to predict the trend of stock price. Teixeira et al. [29] have used the technical indicators, which often are used in technical analysis, as the representation of financial data and feed them into the classification model for FTC. Durn-Rosal et al. [30] have used piecewise linear regression-based turning points to segment the target sequence, and then use a NN to predict these points.

Many DL methods have been applied in the classification of time series. With the continuous development of DL in various classifications, the DCNN proposed by Krizhevsky has achieved great success in the field of computer vision [31], in particular, graphics recognition tasks, such as GAN, have achieved remarkable success in computer vision high-quality image generation. The application scenarios of GAN have been rapidly developed, covering images, texts, and time series. GAN have been increasingly researched for data generation, anomaly detection, time-series prediction, and classification as researchers continue to invest in them. Goodfellow et al. first proposed GAN to generate high-quality pictures [32]. Later, Zhan used improved GAN and LSTM to predict satellite images [33], thus obtaining important resources for weather forecasting. The network can effectively capture the evolutionary rules of the weather system, which guides people to accurately forecast the weather. Recently, an increasing number of studies have used generative adversarial nets (GAN) in FTS, and the research on price trend fluctuation prediction is of great practical value. Zhang et al. applied GAN to stock price prediction [34], used GAN to capture the distribution of actual stock data, and achieved good results compared with existing DL methods. Feng proposed a method based on adversarial training to improve neural network prediction models [35]. Their idea is to add disturbances to simulate the randomness of price variables to enhance the model’s generalisation ability and finally predict whether the stock price is rising or falling. The results show that their model performs better than the existing methods. Given the complex, highly noisy, dynamic, nonlinear, non-parametric, and chaotic characteristics of FTS, whether GAN can successfully learn the data distribution characteristics of FTS price trends is a significant challenge.

Previously, some researchers have used GAN to enhance data to optimise classification performance or used GAN to study FTS price forecasts. Others have used adversarial learning to predict price declines and rises. However, few studies have been done on price trend prediction for financial time series. Moreover, there is even less research for deep-learning-based price trend multi-classification of financial time series. Feng explored a model for adversarial training, but it was aimed at binary classification research. Considering the unbalanced distribution of data trends in the multi-classification problem, especially if there are only few sudden drop and rise samples, whether the model can solve these problems will be a considerable challenge. Inspired by previous research, we use GAN to conduct a multi-classification study of price movements in FTS. According to the characteristics of FTS, we know that the challenge of this research is how to allow GAN to learn the price data trend distribution of the original data to achieve a better performance in end-to-end classification. Meanwhile, the three-classification research on the FTS price trend is more challenging than binary classification. However, it has an outstanding reference value for stock trading.

Fig. 1
figure 1

The architecture of ORGAN-FVT

3 Methodology

3.1 Proposed ORGAN-FVT Method

The characteristics of FTS determine the difficulty of its research, but the research has considerable market value. Therefore, learning the trend distribution of data and predicting price fluctuations is worth studying. Many researchers have proposed algorithms to solve these problems. This study proposes a generative adversarial networks from another perspective. GAN is a framework that trains two models, such as a zero-sum game. In the adversarial process, the generator can be seen as a cheater to generate data similar to the actual data. Simultaneously, the discriminator plays the role of a judge to distinguish between the actual and generated data. They can reach an ideal point named the Nash equilibrium state, where the discriminator cannot distinguish between these two types of data. At this point, the generator can obtain the data distribution of the original input. We propose a new GAN architecture for end-to-end three-classification of stock-closing price trends based on this principle. Undoubtedly, price and transaction volume is significant for closing price’s three-category prediction. In addition, technical indicators calculated from price and transaction volume are widely used as input variables in previous studies. In papers [36,37,38] show that many fund managers and investors recognise these 11 technical indicators based on price and trading volume, and these technical indicators are commonly used in the stock market as signals of future market trends. Kim first used these indicators in support vector machines for financial time series forecasting problems in 2003, and then Yakup Kara [36] proposed these indicators in his article. We know that a variety of technical indicators are available. Some technical indicators are effective under trending markets and others perform better under no trending or cyclical markets [39]. We selected these 11 technical indicators for model training based on previous research.

The closing price is an indicator commonly recognised by market participants, and it contains useful information. We selected the daily data of multiple stocks in recent decades, combined with 11 financial factors to classify short, neutral, and long stocks. The 11 technical indicators of stock data in a day are indicators={‘Close’, ‘High’, ‘Low’, ‘Open’, ‘RSI’, ‘ADX’, ‘CCI’, ‘FASTD’, ‘SLOWD’, ‘WILLER’, and ‘SMA’} [40]. These 11 indicators are valuable in precious research, such as technical analysis and mean regression. Therefore, these technical indicators can be used as the input characteristics of stock data for a three-category study of price fluctuation trends. Our input is \( X = \lbrace x_1,x_2,...,x_t \rbrace \), which is composed of daily stock data for t days. Each input X is a vector composed of the 11 indicators, and we input X into the generator to obtain \( \hat{C}_{t+1} \) as false data and record it as \( X_{fake} \). Simultaneously, we record \( C_{t+1} \) as real data recorded as \( X_{true} \). Based on the generator, we extract the output of ConvLSTM and put it into a fully connected layer to generate three types of probability matrices of short, neutral, and long through the softmax activation function, which is defined as follows:

$$\begin{aligned} C_{t+1} = [\alpha ,\beta ,\gamma ],(\alpha +\beta +\gamma =1). \end{aligned}$$
(1)

A detailed structure description is shown in Fig. 1. In the ORGAN-FVT model, both the generator and discriminator try to optimise a value function until they reach an equilibrium point, called the Nash equilibrium. Therefore, we can define our value function V(GD) as follows:

$$\begin{aligned} \mathop {\min }\limits _G \mathop {\max }\limits _D =E[\log D(X_{real})] \nonumber \\ +E [\log (1-D(X_{fake}))]. \end{aligned}$$
(2)

where \( X_{real} \) denotes the actual input of discriminator, and \( X_{fake} \) denotes the fake input of generator. \(D(X_{real})\) represents the discriminator’s actual output, and \(D(X_{fake})\) represents the fake output. Thus, a detailed description is provided in Sect. 3.3. When calculating the error of the probability matrix one-hot encoding, we use the cross-entropy loss function. Given two probability distributions p and q, the cross-entropy of q expressed by q is defined as follows:

$$\begin{aligned} H(p,q)=- \sum _{i=1}^n p(x) \log q(x). \end{aligned}$$
(3)

where p represents the actual label, n represents the category numbers, i corresponds to the category order and q represents the predicted label. We obtain the probability rate matrix \(\hat{C}_{t}\) and calculate the cross-entropy loss with the actual probability matrix \(C_{t}\) at that moment. We present the loss function definitions of the generator and discriminator in the model in Sects. 3.2 and 3.3, respectively. Thus, we can obtain the losses of the generator and discriminator.

$$\begin{aligned} D_{loss}=\, & {} \frac{1}{m} \sum _{}^m H(D(X_{real}),D(X_{fake})). \end{aligned}$$
(4)
$$\begin{aligned} G_{loss}=\, & {} \frac{1}{m} \sum _{t=1}^m H(C_t,\hat{C}_t). \end{aligned}$$
(5)

Where the \(D_{loss}\) denotes the training loss of discriminator, the \(G_{loss}\) denotes the training loss of generator, and the m denotes the length of FTS.

In Fig. 1, the input of the model is an FTS, where \(G_{input}\) denotes the generator’s input, \(D_{fake\_input}\) is the output of ConvLSTM with the softmax function in the last layer, and this output is used as the input of the discriminator. Simultaneously, take the real price trend one-hot matrix \(D_{real\_input}\) from the label of the original data as the input of the discriminator, and the discriminator distinguishes between the true and false outputs of the generator. Moreover, we have added the concept of ordinal regression to the discriminator, and the penalty for predicting long as short and short as long when discrimination increases, thus optimising our classification results. As shown in the legend of Fig. 1, ordinal regression works in the discriminator. It is an optimization method in the discriminator. The classification result shown by the blue arrow is allowed, and it is not allowed by the red arrow in the discriminator. Therefore, the discriminator will increase the penalty for the above two prediction errors. When the result of the generator is correct, the generator keeps the generated results and continues training. If it is wrong, it will return to the discriminator, which will increase the overall error of the GAN model. The generator and discriminator optimise the objective function in Eq. 8 and finally obtain a generator that has learnt the data distribution trend. Eventually, we can use the trained generator to predict the results. We verified its performance using the test sets. A specific experimental description is provided in Sect. 4. We continue to provide a detailed description of the generator and discriminator.

3.2 The Generator

The generator in ORGAN-FVT is designed with ConvLSTM, which has stronger time-series data processing capabilities. The structure of the generator is shown in Fig. 2. It is composed of ConvLSTM, with 11 technical indicators as inputs. The goal is to let \(\hat{C}_{t+1}\) approach \(C_{t+1}\). The output of the generator G(X) is defined as follows:

$$\begin{aligned} h_t= \,& {} g(x). \end{aligned}$$
(6)
$$\begin{aligned} G(x)=\, & {} \hat{C}_{t+1}=\delta (W_h^T h_t+b_h). \end{aligned}$$
(7)

where \(g(\cdot )\) denotes the output of ConvLSTM, and \(h_t\) is the output of the ConvLSTM with X as the input. \(\delta \) denotes the softmax activation function. \(W_h\) and \(b_h\) denote the weight and bias in the fully connected layer, respectively. We also used dropout as a regularisation method to avoid overfitting. Additionally, we can use the concept of a sliding window to predict \(\hat{C}_{t+1}\) by \(\hat{C}_t\) and X.

Fig. 2
figure 2

The generator designed with an ConvLSTM

3.3 The Discriminator

The role of the discriminator is to construct a differentiable function D to classify the input data. The discriminator distinguishes the authenticity of the generator’s data by discriminating between the actual input data and the false input data. We chose the MLP as the generator model, where \(h_{1}\), \(h_{2}\), \(h_{3}\), and \(h_{4}\) are fully connected layers. The Relu activation function was used between the hidden layers, and the softmax function was used for the output layer. Regarding the input and output of the discriminator, we provide the following description. The output of the discriminator is defined as follows:

$$\begin{aligned} D(X_{fake})=\, & {} \rho (d(X_{fake}).\end{aligned}$$
(8)
$$\begin{aligned} D(X_{real})=\, & {} \rho (d(X_{real}). \end{aligned}$$
(9)

where \(d(\cdot )\) denotes the output of MLP and \(\rho \) denotes the softmax activation function, and \(X_{fake}\) and \(X_{real}\) are probability matrices with one row and three columns, representing the probability of the short, neutral, and long at that moment. In Fig. 3, we show the structure of the discriminator. We optimise the prediction results accordingly in the following two situations, which are described as follows:

  1. (a)

    The true label is short (represented by a one-hot matrix as [1,0,0]). We make \(\beta \) in the prediction result \(\hat{C}_{t+1} = [\alpha ,\beta ,\gamma ],(\alpha +\beta +\gamma =1)\) as large as possible, instead of \(\gamma \), to avoid forecasting short as long;

  2. (b)

    The true label is long ([0,0,1]). We make \(\beta \) in the prediction result \(\hat{C}_{t+1}\) as large as possible instead of \(\alpha \), so we can try to avoid forecasting for long as short.

This method provides constraints with practical trading significance for the model. The objective function of the discriminator is as follows:

$$\begin{aligned} D_{loss}=Loss_{real}+Loss_{fake}. \end{aligned}$$
(10)

where \(Loss_{fake}\) denotes the cross-entropy loss between the generator’s input and the negative sample data, \(Loss_{fake}\) is the model’s loss function after the discriminator discriminates the generator’s input from the real data, and \(Loss_{real}\) is the discriminator’s training loss. The concept of ordinal regression is embodied in \(Loss_{fake}\). The abovementioned two situations are constrained by \(Loss_{fake}\) to optimise the classification results of our model. For convenience, let \(\kappa =[1,0,0]\), \(\nu =[0,1,0]\), and \({\textrm{o}}=[0,0,1]\). The definitions of \(Loss_{real}\) and \(Loss_{fake}\) are given as follows:

$$\begin{aligned} Loss_{real}=\, & {} H(D(X_{real}),C_{t+1}). \end{aligned}$$
(11)
$$\begin{aligned} Loss_{fake}=\, {} \left\{ \begin{aligned}H({\textrm{o}},\hat{C}_{t+1}),If \, \hat{C}_{t+1}=\kappa \\H(\nu ,\hat{C}_{t+1}),If \, \hat{C}_{t+1}=\nu \\H(\kappa ,\hat{C}_{t+1}),If \, \hat{C}_{t+1}={\textrm{o}}\end{aligned} \right. \end{aligned}$$
(12)

The first loss \(Loss_{real}\) of the discriminator can be obtained by calculating the cross-entropy loss between the discriminator and the actual data label \(C_{t+1}\). In Eq. 10, the second loss \(Loss_{fake}\) of the discriminator can be obtained by calculating the cross-entropy loss between the predicted value of the generator and the negative sample label. The structure of the model is illustrated in Fig. 3 below.

Fig. 3
figure 3

Discriminator designed using an MLP with \(X_{real}\) and \(X_{fake}\) as the inputs

4 Evaluation

4.1 Datasets

We selected actual stock trading data from theFootnote 1 Yahoo Finance to evaluate our model and selected several classic DL methods as baseline methods. These stock data include three data sets: Tesla Motors (TSLA) stock price, PAICC, and Microsoft Corporation (MSFT) of the National Securities Exchange Negotiation (NASDAQ), which can be downloaded from the Yahoo website. A detailed description of the dataset is given in Table 1. Each stock contains several information indicators such as the opening price (Open), highest price (High), lowest price (Low), closing price (Close), and trading volume (Volume). We construct our label data using the closing price (Close) and define \(x_{i+1}-x_i>\mu \) as short, \(x_{i+1}-x_i<\theta \) as long, and \(x_{i+1}-x_i=\lambda \) as neutral\((0<i<n)\), where \(\mu ,\theta ,\lambda \ge 0 \) is the parameter set according to the corresponding stock. Additionally, previous studies have also widely used technical indicators calculated from prices and transaction volumes as input variables [40]. In addition to the fundamental market indicators of the input variables in this study, 11 other technical indicators are selected, provided by Eq. 1. According to the above technical indicators as input variables, we first normalise the data with z-scores to eliminate the influence of the dimensions between different variables. Our goal is to predict the trend of the stocks closing price on the next day and obtain the trend of the closing price on the \(t+1\) day through the input \(X_t\) of the past t days. Through repeated experiments in this study, we set t to 30. Our data are divided into training and testing. We select the first 85–90% of the data on each stock as the training set and the rest (10–15%) as the test set. We present the trend chart of the three datasets in Fig. 4.

Fig. 4
figure 4

The trend images of three datasets

Figure 4 shows the trend chart of the closing prices of the three stock data over time. We can intuitively see that the price trends in the three datasets are different. The closing prices in the MSFT dataset fluctuated from the beginning. When it reached 2000, it began to decline in an oscillating trend before remaining in a long-term turbulence “stable” until it began to rise in 2012. The closing price in the PAICC dataset fluctuated upward and downward as a whole. In contrast, the closing price in the TSLA data set has been stable from 2010 to 2020 without significant fluctuations and then rapidly rises to the absolute shock dropped. Note that the three datasets represent different data trends, cover most of the natural scenes, and better reflect the robustness of different models. Table 1 shows the detailed data description and dataset division. The rows in the table indicate the date range of the dataset, the length of the dataset, the length of the training set, the length of the validation set, and the length of the test set in the divided data set.

Table 1 The details description of our datasets

4.2 Experimental Settings

In our experiments, the ConvLSTM module of the generator and the MLP module of the discriminator remained unchanged. The optimal number of ConvLSTM cells was found to be in the range of 8–256 cell units through a hyperparameter search. The number of filters in the convolutional layer was set to 256 and 128, the size of the convolution kernel was two, and the activation function was ELU. After the convolutional layer, we add a pooling layer of size two, the convolutional layer is connected to the LSTM layer, and the number of cells is 100, 100. Then, a fully connected layer is output with the softmax activation function. For the fairness of the experiment, we also used the generator parameter settings in the ConvLSTM benchmark method. The number of cells in the four layers of the discriminator is 256, 128, 100, and 3, and the softmax activation function is used in the last fully connected layer to output the probability matrix of the three classifications. The training epochs are kept at 1000, and we set the initial batch size to 60. To prevent overfitting, we add a dropout layer with a value of 0.2 after the CNN layer and the LSTM layer. The method used in this study and the existing comparative experimental methods are all trained using the Adam optimiser [41]. The initial learning rate of the generator is 1e-3, and the final learning rate was 1e-4, and the learning rate of the discriminator is set to the generator 1.2 times a learning rate. For every 50 epochs, if the recall index on the validation set did not improve, the learning rate decreased by 2e-5 until the final learning rate is reached. All model training was performed using the Keras version 2.3.1 library with TensorFlow version 2.0 background. The experimental operating system was Ubuntu 16.04, and an NVIDIA GeForce GTX 1080Ti GPU. Third-party libraries, such as Talib, were used to calculate technical indicators.

4.3 Evaluation Metrics

In this section, we provide a detailed description of the multi-classification indicators used in this study. The specific evaluation indicators are true positive (TP), false negative (FN), false positive (FP), true negative (TN), accuracy, precision, recall, and harmonic mean f1-score based on accuracy and recall. Considering the complex characteristics of FTS, especially the uneven data distribution, we selected the weighted-average indicator. Meanwhile, considering that the small samples in the actual application scenarios of trend prediction are also worthy of attention (such as sudden skyrocketing and falling), we also selected the macro-indicator to better reflect the robustness of the model in the experiment. We provide a detailed description of these indicators. Our classification strategy is short, neutral, and long, recorded as 1, 2, and 3, respectively. The following descriptions are provided.

Table 2 Confusion matrix
Table 3 The experiment result on TSLA

According to the confusion matrix, the following classification performance indicators can be obtained:

  1. 1

    Weighted average

The indicator assigns weights according to different categories, and each category is multiplied by its weight and then added. This method considers the imbalance of categories, and its value is more likely to be affected by common categories. The weight ratio of the number of different categories is \(W_1:W_2:W_3=N_1:N_2:N_3\), where W and N denotes the weight and actual number of samples in this category, respectively.

$$\begin{aligned} Weighted\text{- }precision= & {} \sum _{i=1}^3P_iW_i. \end{aligned}$$
(13)
$$\begin{aligned} Weighted\text{- }Recall= & {} \sum _{i=1}^3R_iW_i. \end{aligned}$$
(14)
$$\begin{aligned} Weighted\text{- }f1\text{- }score= & {} \sum _{i=1}^3FS_iW_i. \end{aligned}$$
(15)

where \(P=\frac{TP}{TP+FP}\) (P denotes Precision), \(R=\frac{TP}{TP+FN}\) (R denotes Recall), \(FS=\frac{TP}{\frac{1}{P}+\frac{1}{R}}\) (FS denotes f1-score), and the subscript i corresponds to the category order in Table 2.

  1. 2

    Macro-average

This indicator directly adds up the evaluation indicators of different categories (Precision/Recall/f1-score) to the average. The feature of this method is to treat each category equally but will be affected by classes with fewer numbers.

$$\begin{aligned} Macro\text{- }precision=(P_1+P_2+P_3)/3. \end{aligned}$$
(16)

where \(P=\frac{TP}{TP+FP}\) (P denotes Precision),

$$\begin{aligned} Macro\text{- }recall=(R_1+R_2+R_3)/3. \end{aligned}$$
(17)

where \(R=\frac{TP}{TP+FN}\) (R denotes Recall),

$$\begin{aligned} Macro\text{- }f1\text{- }score=(FS_1+FS_2+FS_3)/3. \end{aligned}$$
(18)

where \(FS=\frac{TP}{\frac{1}{P}+\frac{1}{R}}\) (FS denotes f1-score).

  1. 3

    Area under curve (AUC)

AUC is the receiver operating characteristic (ROC) curve area for each category. We refer to the definition used to calculate the AUC indicator [42]. To describe the AUC, we first provide the true positive rate (TPR) and false positive rate (FPR) definition.

$$\begin{aligned} TPR=\, & {} TP/(TP+FN). \end{aligned}$$
(19)
$$\begin{aligned} FPR=\, & {} FP/(FP+TN). \end{aligned}$$
(20)

We can obtain the ROC for each category, and the AUC can be calculated from the ROC curve area under each category.

Table 4 The experiment result on PAICC
Table 5 The experiment result on MSFT

4.4 Experimental Results

We conducted a detailed experimental analysis on the three datasets of TSLA, PAICC, and MSFT based on several different comparison methods. We selected macro, weighted, and AUC based on the multi-classification indicators given in Sect.4.2. Among them, the indicators of AUC include the classification performance description of each category of short, neutral, and long. Macro and weighted include the corresponding precision, recall, and f1-score indicators. The detailed results are listed in Tables 3, 4, and 5. Our method corresponds to GAN-FVT and ORGAN-FVT in the table. GAN-FVT model was optimised using ordinal regression, whereas ORGAN-FVT was optimised by adding ordinal regression. For ease of description, the bold value in our table represents the best value in the comparison, and the underlined value indicates the second best. Simultaneously, the Macro-f1-score and Weighted-f1-score indicators of different methods on the three datasets are shown in Figs. 5, 6, and 7.

4.4.1 Results On Three Datasets

From Table 3, note that on the TSLA dataset, the ORGAN-FVT model performed better than the contrasted DL methods on seven indicators, primarily the indicator on Class1 AUC reached 0.7698. Compared with the highest value of 0.6372 in the comparison method, an increase of 0.1326. In the other evaluation indicators, the macro-average and weighted average are better than the contrasted DL methods. As can be observed in Fig. 5, compared to GAN-FVT, the ORGAN-FVT has improved by 0.0363 and 0.042 in the indicators of Macro-f1-score and Weighted-f1-score.

Fig. 5
figure 5

The f1-scores of TSLA

Fig. 6
figure 6

The f1-scores of PAICC

From Table 4, note that on the PAICC dataset, the ORGAN-FVT model performed better than the contrasted DL methods on seven indicators, especially when the indicators on Macro-f1-score reached 0.3586. Compared with the highest value of 0.3314 in the comparison method, it is increased by 0.0272. As can be observed in Fig. 6, compared to the GAN-FVT, the ORGAN-FVT model has improved on both Macro-f1-score and Weighted-f1-score, increasing by 0.1026 and 0.0956, respectively.

From Table 5, note that on the MSFT dataset, the ORGAN-FVT model performed better than the contrasted DL methods on seven indicators, primarily the Class1 AUC indicator reached 0.5703. Compared with the highest value of 0.5440 in the comparison method, it is improved by 0.0263. As can be observed in Fig. 7, compared to the GAN-FVT model, the optimised ORGAN-FVT model slightly improved in Macro-f1-score and Weighted-f1-score.

Fig. 7
figure 7

The f1-scores of MSFT

Table 6 The results of confusion matrix in the datasets

From Tables 3, 4 and 5, we know that the ORGAN-FVT model outperforms the existing DL methods for most indicators. Note that we selected the best performance among the methods for comparison with our method. Based on a separate indicator for each dataset, our method will have a more remarkable improvement. The generator in our ORGAN-FVT was the ConvLSTM. Tables 3, 4 and 5 show that GAN-FVT and ORGAN-FVT improved on several of the nine indicators.

4.4.2 Results Based on Ordinary Regression

Note that ConvLSTM is added as a generator to GCN, and the classification performance is improved compared to the end-to-end ConvLSTM, and the ORGAN-FVT model optimised by adding ordinal regression improved classification results compared with the GAN-FVT model. We present the confusion matrix results for the experimental data set in Table 6. In Table 6, the first type of error is used to predict long as short and neutral. Next, we provide the proportion of short in short and neutral. The smaller ones are marked with bold numbers in parentheses, and the larger ones are marked with the underline and perform the same operation in the second type of error predicting short as long and neutral. In the TSLA and PAICC datasets, our model ORGAN-FVT outperforms GAN-FVT in the above two cases, with the short ratio decreasing in the first type of error and the long ratio decreasing in the second type of error. However, on the MSFT dataset with a relatively uniform data distribution in the test set, ORGAN-FVT performs slightly worse than GAN-FVT in avoiding the situation of predicting short as long. We assume that the data distribution of this dataset is more suitable for the GAN-FVT model; therefore, in this case, it outperforms ORGAN-FVT. However, after ordinal regression optimisation, the ORGAN-FVT model performs better than GAN-FVT in the above two cases, and the overall performance is improved.

5 Discussion and Conclusion

Our improved GAN model significantly improves the existing DL methods in researching the movement trend classification of financial time-series prices. We add ConvLSTM to our model as the generator, which has excellent time-series processing capabilities. The experimental results show that it is better than CNN and ConvLSTM alone in end-to-end classification. Furthermore, the addition of the concept of ordinal regression improves the performance of our model in the classification results, particularly when predicting short to long and long to short, which is not conducive to actual trading in reality. Compared with GAN-FVT, ORGAN-FVT has been further optimised under the above circumstances so that our model improves the overall classification performance and guides actual transactions. The experimental results also show that our model has improved classification performance compared to the benchmark method ORGAN-FVT on datasets with different distribution characteristics. However, the proposed ORGAN-FVT model still has the following limitations: (1) The 11 technical indicators selected in this experiment may not be the best, which requires further research to optimise different indicator combinations that may have different effects on model performance and (2) In this study, we investigated FTS; however, whether the model can be applied in other time series is worth studying. A comparative analysis of the above factors is also a follow-up work arrangement.