1 Introduction

Predicting stock market prices is a complex endeavour due to the intrinsic and tangled nature of a multitude of factors that influence the market, including macroeconomic conditions, natural events, and investor sentiment Engle et al. (2013). Professional traders and researchers usually forecast price movements by understanding key market properties, such as volatility or liquidity, and recognizing patterns to anticipate future market trends Bouchaud et al. (2018). Effective mathematical models are essential for capturing complex market dependencies. The recent surge in artificial intelligence has led to significant work in using machine learning algorithms to predict future market trends (Cao 2022; Jiang 2021; Sezer et al. 2020). Recent Deep Learning (DL) models have achieved over 88% in F1-Score in predicting market trends in simulated settings using historical data Tran et al. (2021). However, replicating these performances in real markets is challenging, suggesting a possible simulation-to-reality gap (Liu et al. 2022; Zaznov et al. 2022).

Due to the broad interest in this field, in recent years, many researchers have proposed survey papers that study the vast literature on stock market predictions. Most of these surveys propose meticulous classification and analysis of the existing literature and review of the implemented models based on the results shown in the original papers. Nevertheless, most of these works consist of desk-based research, and only a few of them propose new experiments to validate the collected results. For the first time, in this paper, we benchmark the most recent and promising DL approaches to Stock Price Trend Prediction (SPTP) based on Limit Order Book (LOB) data, one of the most valuable information sources available to traders on the stock markets.

The LOB aggregates orders for shares of a given stock over time, characterizing each order by its associated price and volume of shares. SPTP is the problem of forecasting the stock price trend based on LOB data, classifying it as either upward, downward, or stable.

We compare novel data-driven approaches from Machine Learning (ML) and DL that analyze the market at its finest resolution, using high-frequency LOB data. Our benchmark evaluates their robustness and generalizability (Pineau et al. 2021; Gundersen and Kjensmo 2018; Baker 2016). In particular, we assess the models’ robustness by comparing the stated performance with our reproduced results on the same dataset FI-2010 Ntakaris et al. (2018). We also assess their generalizability by testing their performance on unseen market scenarios using LOBSTER data [13].

Furthermore, we enrich our experiments by including classical ML tree-based models and ensemble methods. In addition, we also provide a profit analysis by conducting a trading simulation. Our code is organized in a modular framework, called LOBCAST, which is openly accessible for users. Our findings reveal that while the best models exhibit robustness, achieving solid F1-Scores on FI-2010, they show poor generalizability, as their performance significantly drops when applied to unseen LOB market data. Our experiments show that most of the attention-based DL models outperform the other approaches. Our results provide insightful evidence of possible weaknesses of the current state-of-the-art in SPTP, which allow us to add a critical discussion about how to improve models’ generalizability, data labelling, and representation.

The main contributions of our work are the following:

  • We release a highly modular open-source framework called LOBCAST,Footnote 1 to pre-process data, train, and test stock market models. Our framework employs the latest DL libraries to provide all researchers an easy, performing, and maintainable solution. Furthermore, to support future studies, we release two meta-learning models and a backtesting environment for profit analysis.

  • We evaluate existing LOB-based stock market trend predictors, showing that most of them overfit the FI-2010 dataset with remarkably lower performance on unseen stock data.

  • In order to guide model selection in real-world applications, we evaluate the sensitivity of the models to the data labeling parameters, compare the performance of both DL and non-DL models, and evaluate and discuss the financial performance of existing models under different market scenarios.

  • We discuss the strengths and limitations of existing methodology and identify areas for future research toward more reliable, robust, and reproducible approaches to stock market prediction.

The remainder of this paper is organized as follows: in Sect. 2, we review existing work; in Sect. 3, we introduce the stock trend prediction problem; in Sect. 4 we present all the machine learning models scrutinised in this work and in Sect. 5 we describe the datasets used to train the model for the task; in Sect. 6 we benchmark the analyzed approaches. In Sect. 7, we discuss future directions and conclusions reached by our research. Finally, we devote Appendix 9. to a detailed description of the selected models, whose robustness and generalizability study is further enriched by the additional experimental results reported in Appendix 10..

2 Related work

The increasing interest in DL for price trend prediction motivated several researchers to collect and analyze State-Of-the-Art (SOTA) solutions in benchmark surveys. The study by Jiang (2021) analyzes papers published between 2017 and 2019 that focused on stock price and market index prediction. In their literature review, the authors studied Neural Network (NN) structures and evaluation metrics used in selected papers, as well as implementation and reproducibility. This work was extended by Kumbure et al. (2022), including an in-depth analysis of the data (i.e., market indices and input variables used for stock market predictions). Ozbayoglu et al. (2020) provide a comprehensive overview of SOTA DL and ML algorithms that are commonly used for finance applications. The authors then survey numerous papers tackling some of such applications with DL and ML models, e.g., portfolio management, fraud detection, and risk assessment. With a similar approach, Sezer et al. (2020) summarize the most used DL models for several finance applications, including stock price forecasting. The work by Hu et al. (2021) surveys 86 papers on stock and foreign exchange price prediction. The authors review the datasets, variables, models, and performance metrics used in each surveyed article. In Nti et al. (2020), the authors conduct a systematic and critical review of 122 papers for stock prediction. To evaluate the results of the surveyed papers, the authors also implement three baselines DL and ML algorithms which are commonly exploited in the reviewed literature: Decision Trees (DTs), Support Vector Machine (SVM) and Artificial Neural Network (ANN). The authors show that ANN achieves the best performance in terms of different error metrics, followed by DTs and SVM.

In contrast to the aforementioned works that have primarily surveyed and reviewed the literature on the SPTP task in general, our focus is specifically on papers addressing this task using LOB data, as will be discussed in Sect. 3. Moreover, our work does not narrow to evaluating the results reported in the surveyed papers and to proposing a validation through classical baselines. Instead, we also evaluate the generalizability of the models by running tests on different datasets.

Several studies include sentiment analysis data for price trend prediction. The works by Shah et al. (2022) and Al-Alawi and Alaali (2023) analyze solutions based on sentiment analysis through Natural Language Processing (NLP) to investigate the impact of social media on the stock market, showing that this combination improves the accuracy of stock prediction models. Similar conclusions are reported in Nguyen et al. (2015), Li et al. (2014).

A different research topic than the one proposed in this paper is the design of recommend systems to select the most profitable stocks. This field of research relies on the observation that some investors might be interested in predicting the top profitable k stocks instead of price trends. Saha et al. (2021) propose a new measure for stock ranking prediction to maximize investors’ profit. The work in Alsulmi (2022) explores rank-based ML-based approaches and identifies a feature set that contains various statistics indicating the performance of stock market companies that can be used to train several ranking models. Song et al. use two DL models to design learning-to-rank algorithms to construct equity portfolios based on sentiment news Song et al. (2017).

Rundo et al. (2019) presented a comprehensive overview of traditional and ML-based approaches for stock market prediction and highlighted some limitations of traditional approaches, showing that DL models outperform them in terms of accuracy. Similar findings are reported by Mintarya et al. (2023). Lim and Zohren (2021) discussed recent developments in hybrid DL models, which combine statistical and learning components for both one-step-ahead and multi-horizon time-series forecasting. Similarly, Shah et al. (2019) discussed hybrid approaches in their work on the SOTA algorithms commonly applied to stock market prediction. Additionally, they provided a taxonomy of computational approaches for stock market analysis and prediction. Olorunnimbe and Viktor (2023) explore applications of DL in finance and stock markets, with a particular emphasis on works proposing backtesting meeting the requirements for real-world use. They reviewed various scenarios of DL models in finance, with a focus on trade strategy, price prediction, portfolio management, and others. The authors also underline whether the surveyed papers are reproducible. Nevertheless, the reproducibility is not studied by means of experiments but rather just by checking whether the authors of the collected papers provided an open-source code. Finally, L. Lucchese et al. (2022) study a similar problem focusing on order book-driven predictability by using deep learning techniques. They focus on the order book representation, including a novel representation of volumes, and they evaluate the relation between the price predictability and how far ahead we can actually predict. They meticulously address questions related to data representation, yet they focus their study on only three models. Differently from this work, our study considers a broader range of existing works, providing related implementation and comparative analysis.

Several works exploit technical indicators such as moving average convergence/divergence, momentum analysis, volume on balance, and relative strength index to predict market trends with ML (see e.g., Fei and Zhou 2023; LAI et al. 2019; Ratto et al. 2018). In our paper, we benchmark papers that propose models for stock price trend prediction relying on LOB data. While data using technical indicators offer explainability and can be effective for market predictions, our work studies models that only include market data at its finest granularity and aims to investigate if DL models can extract useful information from raw data going beyond technical indicators.

Our work is the first to provide a benchmark of recent DL approaches applied to the SPTP task, utilizing LOB data. In contrast to previous work, we re-implement the surveyed papers to evaluate their robustness and generalizability. Furthermore, we have released an open-source framework designed for data preprocessing, model training, and testing. Our framework also incorporates profit analysis capabilities that users can exploit to test their own price trend prediction model.

3 The stock price trend prediction problem

The common ground that unifies the models studied in this paper is the goal of solving the SPTP problem via Deep Neural Networks (DNNs) trained on LOB data. LOB data are particularly enlightening as they provide raw and granular information on stocks’ trades. By observing the LOB in a fixed period of time, SPTP models return a distribution over the possible future market movements.

Fig. 1
figure 1

An example of LOB

3.1 Limit order book (LOB)

A stock exchange employs a matching engine for storing and matching the orders issued by the trading agents. This is achieved by updating the so-called Limit Order Book (LOB) data structure. Each security (tradable asset) has a LOB, recording all the outstanding bid and ask orders currently available on an exchange or a trading platform. The shape of the order book gives traders a simultaneous view of the market demand and supply.

There are three major types of orders. Market orders are executed immediately at the best available price. Limit orders, instead, include the specification of a desired target price: a limit sell [buy] order will be executed only when it is matched to a buy [sell] order whose price is greater [lower] than or equal to the target price. Finally, a cancel order removes a previously submitted limit order.

Figure 1 depicts an example of a LOB snapshot, characterized by buy orders (bid) and sell orders (ask) of different prices. A level, shown on the horizontal axis, represents the number of shares with the same price either on the bid or ask side. In the example of Fig. 1, there are three bid and three ask levels. The best bid is the price of the shares with the highest price on the buy side; analogously, the best ask is the price of the shares with the lowest price on the sell side. When the former exceeds or equals the latter, the corresponding limit ask and bid orders are executed. The LOB is updated with each event (order insertion/modification/cancellation) and can be sampled at regular time intervals.

In Huang and Stoll (1994), Tran et al. (2022), Pascual and Veredas (2003), Cao et al. (2008), Cao et al. (2009), Duong and Kalev (2014) it has been empirically demonstrated, using both linear and non-linear models, that the orders behind the best bid and ask prices have a significant impact to price discovery and contain information about short-term future price movements, supporting the hypothesis that leveraging deeper levels of the limit order book is essential for improving the performance of SPTP tasks. This is the main reason for not restricting the input to the best levels. Additionally, deep learning models are well suited to handle high-dimensional input.

We represent the evolution of a LOB as a time series \({\mathbb {L}}\), where each \({\mathbb {L}}(t) \in {\mathbb {R}}^{4L}\) is called a LOB record, for \(t=1, \ldots , N\), being N the number of LOB observations and L the number of levels. In particular,

$$\begin{aligned} {\mathbb {L}}(t) = \{P^s(t), V^{s}(t)\}_{s\in \{\texttt {ask}, \texttt {bid}\}} \end{aligned}$$

where \(P^{\texttt {ask}}(t), P^\texttt {bid}(t) \in {\mathbb {R}}^{L}\) represent the prices of levels 1 to L of the LOB, on the ask (\(s=\texttt {ask}\)) side and bid (\(s=\texttt {bid}\)) side, respectively, at time t. Analogously, \(V^\texttt {ask}(t), V^\texttt {bid}(t) \in {\mathbb {R}}^{L}\) represent the volumes. This means that for each t and every \(j\in \{1,\ldots ,L\}\) on the ask side, \(V^\texttt {ask}_{j}(t)\) shares can be sold at price \(P^\texttt {ask}_{j}(t)\). The mid-price m(t) of the stock at time t, is defined as the average value between the best bid and the best ask,

$$\begin{aligned} m(t) = \frac{P^\texttt {ask}(t) + P^\texttt {bid}(t)}{2} \end{aligned}$$

On average, if most of the executed orders are on the ask [bid] side, the mid-price increases [decreases] accordingly.

3.2 Trend definition 

We use a ternary classification for trends: U (“upward”) if the price trend is increasing; D (“downward”) for decreasing prices; and S (“stable”) for prices with negligible variations. Among all the possible single values, mid-prices provide the most reliable indication of the actual stock price for equity markets. Nevertheless, because of the market’s inherent fluctuations and shocks, they can exhibit highly volatile trends. For this reason, using a direct comparison of consecutive mid-prices, i.e., m(t) and \(m(t+1)\), for stock price labelling would result in a noisy labelled dataset. As a result, labelling strategies typically employ smoother mid-price functions instead of raw mid-prices. Such functions consider mid-prices over arbitrarily long time intervals, called horizons. Our experiments adopt the labelling proposed in Ntakaris et al. (2018) and repurposed in several other SOTA solutions we selected for benchmarking. The adopted labelling strategy compares the current mid-price to the average mid-prices \(a^+(k, t)\) in a future horizon of k time units, formally:

$$\begin{aligned} a^+(k,t)= \frac{1}{k} \sum _{i=1}^{k} m(t+i). \end{aligned}$$
(1)

The average mid-prices are used to define a static threshold \(\theta \in (0,1)\) that is used to identify an interval around the current mid-price and define the class of the trend at time t as follows:

$$\begin{aligned} \begin{aligned} {\texttt {U}:}\, a^+(k,&t) > m(t)(1+\theta ),\,\, {\texttt {D}:}\, a^+(k,t) < m(t)(1-\theta ), \\ {}&{\texttt {S}:}\, a^+(k,t) \in [m(t)(1-\theta ), m(t)(1+\theta )]. \end{aligned} \end{aligned}$$
(2)

With this labelling, we beat the effect of mid-price fluctuations by considering their average over a desired horizon k and considering a trend to be stable when the average mid-price variations do not change significantly, thus avoiding over-fitting. We highlight that timestamp t can come either from a homogeneous or an event-based process. In our experiments, we consider the latter approach. Hence, the horizon k is expressed in the number of future events.

3.3 Models I/O

Given the time series of a LOB \({\mathbb {L}}\) and a temporal window \(T = [t-h, t]\), \(h\in {\mathbb {N}}\), we can extract market observations on T, \({\mathbb {M}}(T)\), by considering the sub-sequence of LOB observations starting from time \(t-h\) up to t. Fig. 2 gives a representation of a market observation \({\mathbb {M}}(T) \in {\mathbb {R}}^{h\times 4L}\). The market observation over the window \([t-h,t]\) is associated with the label computed through Eqs. 1 and 2 at time t. An SPTP predictor takes as an input a market observation and outputs a probability distribution over the trend classes U, D, and S.

Fig. 2
figure 2

An example of market observation

4 Models

We selected and surveyed 13 SOTA models based on DL for the SPTP task using LOB data. Models selection was made based on their prominence in the literature and widespread usage. These models represent a diverse set, covering various DL architectures and methods. These models were published in papers between 2017 and 2022 and are described in detail in Appendix 9.. We also include in our analysis two additional baselines, namely Multilayer Perceptron (MLP) and Long-Short Term Memory (LSTM), which were used as a benchmark in Tsantekidis et al. (2017a) and in Tsantekidis et al. (2020), respectively. All proposed models are based on DNNs and were originally trained and tested on the FI-2010 dataset. We also study two ensemble methods, described in Sect. 4.2. Table 1 summarizes the most peculiar characteristics of the studied models, which we comment on in Sect. 4.1.

4.1 Summary of models

Table 1 summarizes the most peculiar characteristics of the selected models. The temporal shape represents the length of the input market observation for the model. In the table, the features shape refers to the number of features used by the models to infer the trend in the original papers. In the Table, we also indicate whether the authors released the code, and if so, whether they have used PyTorch Paszke et al. (2019) or TensorFlow Abadi et al. (2016). This is relevant because to ensure consistency and compatibility within our proposed framework, based on PyTorch Lightning, we found it necessary to re-implement models for which the code was not available or was only available in Tensorflow. We made every effort to validate and verify the correctness of our re-implementations, including making our code publicly available, which facilitates collaboration and scrutiny from the research community. We remark that to improve the reproducibility of the results, it would be advisable for the research community to publish the code developed to carry out the experiments.

In High-Frequency Trading (HFT) and algorithmic trading in general, minimizing latency between model querying and order placement is of utmost importance Gomber and Haferkorn (2015). To explore this aspect, we analyzed the inference time in milliseconds of all models, based on the experiments reported in Sect. 6.3. As shown in Table 1, DEEPLOB, DEEPLOBAT, AXIALLOB, TRANSLOB, and ATNBoF had inference times in the order of milliseconds, potentially unsuitable for HFT applications compared to other models with shorter times. Finally, we have reported the number of trainable parameters for each model. A noteworthy observation is that the average number of parameters is very low compared to other classical deep learning fields, such as computer vision He et al. (2016) and natural language processing (Devlin et al. 2018; Brown et al. 2020). This leads us to conjecture that current systems are inadequate in effectively handling the complexity of LOB data, as we will verify in the rest of this paper.

Table 1 Relevant characteristics of the selected models

4.2 Ensemble methods

To explore the possibility of achieving new SOTA performance by combining the predictions of all 15 models, we have implemented two ensemble methods: MAJORITY and METALOB.

The MAJORITY ensemble assigns the class label that appears most frequently among the predictions of the classifiers. To account for variations in the performance of individual classifiers, we incorporate a weighting scheme based on their F1-Scores. This ensures that predictions from higher-performing models carry more influence in the final decision.

The METALOB meta-classifier is implemented as a multilayer perceptron (MLP) with two fully connected layers. It is designed to learn how to effectively combine the outputs of the 15 DL models, which serve as the base classifiers to produce the final output. The input to the meta-classifier is a 1D tensor with a probability distribution over the trends (up, stationary, down) for each of the models, resulting in a tensor of \(3\cdot 15\) elements.

5 Datasets

LOB data are not often publicly available and very expensive: stock exchanges (e.g., NASDAQ) provide fine-grained data only for high fees. The high cost and low availability restrict the application and development of DL algorithms in the research community. In the sections that follow, we will introduce the reader to two datasets that will be used to analyze the performances of the models under the robustness and generalizability point of view, that are FI-2010 and LOB-2021/2022 respectively.

5.1 FI-2010 to test robustness

The most widely spread public LOB dataset is FI-2010 which is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0) and was proposed in 2017 by Ntakaris et al. Ntakaris et al. (2018) with the objective of evaluating the performance of machine learning models on the SPTP task. The dataset consists of LOB data from five Finnish companies: Kesko Oyj, Outokumpu Oyj, Sampo, Rautaruukki, and Wärtsilä Oyj of the NASDAQ Nordic stock market. Data spans the time period between June 1st to June 14th, 2010, corresponding to 10 trading days (trading happens only on business days). About 4 million limit order messages are stored for ten levels of the LOB. The dataset has an event-based granularity, meaning that the time series records are not uniformly spaced in time. LOB observations are sampled at intervals of 10 events, resulting in a total of 394,337 events. This dataset has the intrinsic limitation of being already pre-processed (filtered, normalized, and labelled) so that the original LOB cannot be backtracked, thus hampering thorough experimentation. Additionally, the labelling method employed is found to be prone to instability, as demonstrated by Zhang et al. in Zhang et al. (2019).

Table 2 Class balancing on FI-2010

The dataset provides the time series and the classes relative to five horizons \(k \in {\mathcal {K}} = \{1, 2, 3, 5, 10\}\) by leveraging the trend definitions described in Eq. 2. Such a labelling scheme is very sensitive to the threshold \(\theta\) regarding the resulting balancing between “upward”, “downward” and “stable” trends. Table 2 shows the class balancing for different horizons \(k\in {\mathcal {K}}\). The authors of the dataset employed a single threshold \(\theta = 0.002\) for all horizons, but it balances only the case of \(k = 5\). As it can be observed, the stationary class S is progressively less predominant in favour of the upward and downward classes. In our experimental campaign, the class imbalance is not addressed to guarantee a fair robustness evaluation since the considered works do not claim to have done so.

Fig. 3
figure 3

Stocks returns from day 0 for LOB-2021/2022

Table 3 LOB-2021 stocks main features
Table 4 Class balancing on LOB-2021 and LOB-2022

5.2 LOB-2021/2022 to test generalizability

To test the generalizability of the models in a more realistic scenario, we used data extracted from LOBSTER [13], an online LOB data provider for order book data, which is publicly available for the research community with an annual fee. LOBSTER limit order books are reconstructed directly from NASDAQ traded stocks. To compare the performance of the algorithms in a wide range of scenarios, we have created a large LOB dataset, including several stocks and time periods. The chosen pool of stocks includes those from the top 50% more liquid stocks of NASDAQ. To construct a diversified evaluation scenario, we selected six stocks, namely: SoFi Technologies (SOFI), Netflix (NFLX), Cisco Systems (CSCO), Wing Stop (WING), Shoals Technologies Group (SHLS), and Landstar System (LSTR). The periods in consideration are July 2021 (2021-07-01 to 2021-07-15, 10 trading days) making up LOB-2021, and February 2022 (2022-02-01 to 2022-02-15, 10 trading days) making up LOB-2022. The selection of these two periods aimed to capture data from periods with different levels of market volatility. February 2022 exhibited higher volatility compared to July 2021, largely influenced by the Ukrainian crisis. This allows for an assessment of models across varying market conditions. Further details on stock selection and processing are provided in the following two paragraphs.

5.2.1 Stocks selection

To make up LOB-2021 and LOB-2022 and consider variegated evaluation scenarios, we curated a pool of 630 stocks from NASDAQ exchange with a market capitalization that ranged from \(\sim 2\) Billion to \(\sim 3\) Trillion dollars. Data was gathered from NASDAQ Stock Screener (Nasdaq, https://www.nasdaq.com/market-activity/stocks/screener). From the pool of stocks, we generated 6 clusters with t-distributed Stochastic Neighbor Embedding (t-SNE) to capture stock differences in the years 2021-2023. We used the following features: daily return, hourly return, volatility, outstanding shares, P/E ratio, and market capitalization. The P/E ratio indicates the ratio between the price of a stock (P) and the company’s annual earnings per share (E). The analysis led to the identification of 6 stocks that are nearest to the cluster centroids of the generated 3-dimensional latent space. The stocks make up the set \({\mathcal {S}}=\{\)SOFI, NFLX, CSCO, WING, SHLS, LSTR\(\}\). Table 3 captures the main features of these stocks for the period of July 2021. The selected stocks have very variable average daily returns, the minimum being SHLS and the maximum being NFLX. Daily and Hourly returns highlight that some stocks are more volatile than others. The market capitalization represents the total value of the outstanding common shares stockholders own. Stocks show different class balancing in the training set. CSCO is the stock with a major imbalance toward the stable class, whereas NFLX and LSTR are more unbalanced towards the up and down classes, respectively. In Sect. 6, we analyze the reasons behind the occurrence of class imbalance specific to individual stocks and discuss its impact. The mid-price movements for these two periods and the selected stocks are depicted in Fig. 3.

5.2.2 Stock processing

We build LOB-2021 and LOB-2022 resembling the structure of the FI-2010 dataset, described in the previous section and proposed in Ntakaris et al. (2018). In particular, to generate the LOB-2021/2022 datasets, we utilize data from the LOBSTER data provider, which consists of LOB records (i.e., \({\mathbb {L}}(t)\) vectors) resulting from events caused by traders at the exchange. LOBSTER associates these records with the specific events that caused changes in the LOB. We isolated the following types of events: order submissions, deletions, and executions, which account for almost all the events in the markets.

For each stock in the set \({\mathcal {S}}\) we construct a stock time series of LOB records \({\mathbb {L}}_s(t) \in {\mathbb {R}}^{4\,L}\), with \(L=10, s \in {\mathcal {S}}\), \(N_s\) being the amount of records of the stock s in the considered temporal interval (e.g., (2021-07-01, 2021-07-15) for LOB-2021), \(t \in [1, N_s]\). We recall that the \(4\cdot 10\) features represent the prices and volumes on the buy and sell sides for the ten levels of the LOB. We highlight that the time series \({\mathbb {L}}_s\) are non-uniform in time since LOB events can occur at irregular intervals driven by traders’ actions. We do not impose temporal uniformization. Instead, we sample the market observation every ten events, similarly to the stock processing performed for FI-2010 dataset. Furthermore, we do not account for liquidity beyond the 10th order level in the LOB. This approximation is necessary to ensure computational tractability while retaining the most influential levels. It is a commonly employed technique in stock market prediction models, also employed in FI-2010.

Each stock time series \({\mathbb {L}}_s\) is split into training, validation, and testing sets using a 6-2-2 days split. Normalization is performed on stock time series using a z-score approach, separately normalizing the prices and volumes. The mean and standard deviation are calculated from the union of the training and validation splits for all stock time series. These statistics are then used to normalize the entire dataset, including the test splits. The final dataset is constructed by vertical stacking (i.e., concatenating along the rows) the six training splits (i.e., one for each stock), six validation splits, and six test splits in this order.

The dataset is used to extract market observations with a sliding window approach, as explained in Sect. 3. The training set is randomly permuted according to the standard procedure adopted by the SOTA papers in this field.

Labelling market observations is accomplished by leveraging the trend definitions described in Eq. 2, mapping market observations to the corresponding trend based on a predefined prediction horizon \(k\in {\mathcal {K}}\). It is important to note that a new dataset is generated for each prediction horizon \({\mathcal {K}}\). Consequently, LOB-2021 and LOB-2022 consist of five (i.e., \(|{\mathcal {K}}|\)) distinct datasets, each corresponding to one of the five prediction horizons. As for FI-2010 dataset, we chose the labelling threshold \(\theta\) of Eq. 2 such that the resulting dataset is balanced for \(k=5\). In Table 4 the class balancing of LOB-2021/2022 datasets for each of the horizons and each split is reported. For a fair comparison, the reported balancing is similar with respect to FI-2010 dataset shown in Table 2.

5.3 Data distribution shift

LOB time series are susceptible to distribution shift due to the dynamic nature of the stock markets. This implies that when used in production, the input data seen by the model may deviate from the data it was trained on Bennett and Clarkson (2022). Such shifts can lead to serious consequences that directly translate into misclassification and significant economic losses if the model is not effectively monitored. It is important to note that the primary focus of this work, and the considered studies, is on the trend classification task. Deploying these models into production demands additional effort, particularly in addressing challenges like distribution shift. Being able to observe the market in time, allows for new data collection facilitating continual model retraining, data weighting for past forgetting, and fine tuning. These methods should be adopted in conjugation with the usual techniques to prevent data overfitting, such as early stopping, data augmentation, and regularization.

6 Experiments

We conducted an extensive evaluation to assess the robustness and generalizability of 15 DL models to solve the SPTP task, as presented in Sect. 3. Among these, 13 were SOTA models, and 2 DL baseline models commonly used in the literature. More details on the models are given in Sect. 4 and in Appendix 9..

In line with many other studies, we adopt the definition of robustness and generalizability introduced by J. Pineau et al. in their work Pineau et al. (2021). Robustness is the ability of a model to replicate its performance when tested on the same data but under different analyses, such as reimplementation of the code, testing of the code on a different computer architecture, and other contextual changes. Generalizability is the ability of a model to replicate its performance when tested on different data and different analytical tools. Robustness is evaluated by testing the proposed models on FI-2010, the benchmark dataset employed in all surveyed papers. To evaluate the generalizability, we use LOB-2021 and LOB-2022, retrieved from LOBSTER data provider Berlin xxxx. In some cases, the authors of the considered works have not provided crucial information, such as the code or the hyperparameters of their models, making reimplementation and hyperparameter search necessary. Our experiments were carried out using LOBCAST (LOBCAST xxxx), the open-source framework we developed and made available online. The framework allows the definition of new price trend predictors based on LOB data.

In Sect. 6.1 we describe our framework and its potential applications. In Sect. 6.2 we introduce a complete description of the hyperparameters search. In Sect. 6.3 we discuss the results deriving from the robustness (Sect. 6.3.1) and generalizability (Sect. 6.3.2) studies, and a focus to the performance of ensemble methods (Sect. 6.3.3). In Sect. 6.4 we expand on the performance achieved for the SPTP task when adopting varying labeling strategies (Sect. 6.4.1) and using non-deep models (Sect. 6.4.2). We conclude with a profitability study in Sect. 6.4.3.

6.1 LOBCAST framework for SPTP

We present LOBCAST [60], a Python-based framework developed for stock market trend forecasting using LOB data. LOBCAST is an open-source framework that enables users to test DL models for the SPTP task. LOBCAST contains the implementation of the 15 DL models that were used in the experiments. We believe that LOBCAST, along with the advancements in DL models and the utilization of LOB data, has the potential to improve the state of the art on price trend forecasting in the financial domain.

6.1.1 Applications and features

The core application of LOBCAST is to provide a standardized benchmarking of DL-based models for the SPTP task. There are several choices to make while implementing a DL model for SPTP. We collected these choices in LOBCAST to ease the development and evaluation of new models. This not only offers a methodological and standardised approach to addressing the problem but also simplifies the comparison between these choices.

LOBCAST features include: (1) LOB data pre-processing utilities dealing with normalization, splitting, and labelling. (2) A training environment for DL models implemented in PyTorch Lightning Paszke et al. (2019). (3) Integrated interfaces with the popular hyperparameter tuning framework WANDB Biewald (2020), which allows users to tune and optimize model performance efficiently. (4) Generation of detailed reports for the trained models, including performance metrics regarding the learning task (F1-Score, Accuracy, Recall, etc.). (5) Generation of reports measuring the complexity of the models in terms of number of parameters, inference and training time. (6) Support for backtesting for profit analysis, utilizing the Backtesting.pyFootnote 2 library. (7) The PyTorch implementation of the SOTA models used in the experiments.

All these features are implemented through a modular and scalable framework that can be easily expanded with new models and components.

6.2 Hyperparameters search

For evaluating the robustness of the surveyed models, we used the hyperparameters reported in the original papers whenever they were available. However, we encountered cases where hyperparameters were not declared at all, such as in LSTM Tsantekidis et al. (2017a) and CNN1 Tsantekidis et al. (2017b), while in other cases, including CNNLSTM Tsantekidis et al. (2020), AXIALLOB Kisiel and Gorse (2022), ATNBOF Tran et al. (2022) and DAIN Passalis et al. (2019) only partial information was provided. To address these gaps, we performed a grid search exploring different values for the batch size, including \(\{16, 32, 64, 128, 256\}\) and the learning rate, including \(\{0.01, 0.001, 0.0001, 0.00001\}\).

Regarding the generalizability experiment, we found that the majority of models using the hyperparameters from the robustness experiment performed poorly on the LOB-2021/2022 datasets. So, we conducted a comprehensive hyperparameter search on horizon \(k=5\) (which is the most balanced) using a grid search approach for all 16 models. For this search, we maintained the same number of epochs and optimizer used in the robustness analysis while searching for batch size and learning rate using the same domains mentioned above. F1-Score maximization over the validation set was the chosen criterion for optimizing the hyperparameters. For a complete overview of the hyperparameters utilized in our experiments, refer to Table 5.

Table 5 Hyperparameters adopted in our experiments

6.3 Performance, robustness and generalizability

To test robustness and generalizability, we conducted our experiments for each model using five different seeds to mitigate the impact of random initialization of network weights and training dataset shuffling. The necessity to produce reliable results within reasonable time constraints led us to opt for a set of five seeds. The training process involved training the 15 models for each seed on each of the considered prediction horizons (\({\mathcal {K}} = \{1, 2, 3, 5, 10\}\)). On average, the training process for all the models took approximately 155 h for FI-2010 and 258 h for each LOB dataset, utilizing a cluster comprised of 8 GPUs (1 NVIDIA GeForce RTX 2060, 2 NVIDIA GeForce RTX 3070, and 5 NVIDIA Quadro RTX 6000).

In Table 6, we summarize the results of our experiments. Our choice of F1-Score as the evaluation metric was motivated by the following reasons: (1) it captures both precision and recall in a single value, (2) the datasets are not well balanced and F1-Score is robust to the class imbalance problem that affects the accuracy measure, (3) it is the only metric that is reported in every SOTA paper. The Table compares the claimed performance of each system (column F1 Claim) with those measured in the robustness (FI-2010) and generalizability (LOB-2021 and 2022) experiments. For each dataset, we show the average performance and the standard deviation achieved by each model in all the horizons, along with its rank.

Table 6 Robustness, generalizability, and performance scores of the models. Arrows indicate whether the measured F1-Score of a system is higher or lower than stated in the original paper. Colour saturation highlights systems with best (green) and worst (red) robustness and generalizability scores

To evaluate the robustness and the generalizability of the models, we compute the robustness and the generalizability scores, a value \(\le 100\) that is computed as \(100 - (|A| + S)\), where A and S are defined as follows. A is the average difference between the F1-Score reported in the original paper and the one that we observed in our experiments on FI-2010 for robustness, and on LOB-2021 and LOB-2022 for generalizability. S is the standard deviation of these differences. The score penalizes models that demonstrate higher variability in their performance by subtracting the standard deviation. The average and standard deviation were computed over the declared horizons for each model and considering all five seeds.

Table 6 clearly highlights the following:

  1. 1.

    Except for a few systems, there is a considerable difference between the claimed performances and those measured in both robustness and generalizability experiments. Note that while the performance gap is negative on average and considerably negative in the scenario of LOB-2021 and 2022, a few systems outperform the claimed results, as highlighted by the arrows in Table 6.

  2. 2.

    All models are very sensitive to hyperparameters; in fact, for about half of the runs, they diverged (F1-Score \(\leqslant 33\%\)) during the hyperparameters search.

  3. 3.

    The ranking of the systems changes considerably if we compare the declared performances with those measured in our experiments. On the other hand, the best six systems in FI-2010 remain the same in LOB-2021 and 2022.

  4. 4.

    The best-ranked systems do not consistently hold the lead in terms of robustness and generalizability - except for BINCTABL. On the contrary, some of them obtained poor generalizability scores, suggesting that they overfitted the FI-2010 dataset.

  5. 5.

    Five of the best six models incorporate attention mechanisms. In particular, the best-performing model is BINCTABL, which enhances the original CTABL model by adding an Adaptive Bilinear Normalization layer, enabling joint normalization of the input time series along both temporal and feature dimensions. On average, BINCTABL improves the F1-Score by up to \(9.2\%\) compared to DLA, i.e., the second-best model, and up to \(13\%\) compared to CTABL.

  6. 6.

    Regrettably, ensemble models (the last two rows in Table 6) do not exceed the performance of the top-performing models, which is probably due to the relatively high agreement rate among systems, as shown in Fig. 13 and Fig. 14 in Sect. 10. of the supplementary material.

Fig. 4
figure 4

Confusion matrices for BINCTABL \((k=1,5,10)\) on FI-2010 and LOB-2021 datasets

6.3.1 Robustness on FI-2010

As far as the robustness experiments are concerned, it is important to note that some models discussed in the literature incorporate additional market observation features for predictions. This is the case for the models DAIN, CNNLSTM, TLOBOF, and DLA. To ensure a fair comparison among the models, we included them in our study but reduced their feature set to only the 40 raw LOB features. Due to the presence of these additional features, a strict robustness study could not be conducted for these models. However, the reduction of features did not necessarily cause a deterioration in performance: of particular interest is the case of CNNLSTM, for which the authors used stationary features derived from the LOB, stating that they were better than the raw ones. Impressively, CNNLSTM achieves the greatest average improvement of \(20.9\%\) among all the models, proving that, for this model, the raw LOB features are better suited to forecast the mid-price movement than the features proposed by the original authors.

Based on these experiments (summarized in Table 6), the BINCTABL model demonstrates the highest F1-Score when averaged over the seeds and prediction horizons, achieving an average of \(82.6\% \pm 7.0\). Notably, the BINCTABL model also exhibits the strongest robustness score of 99.7. For a more comprehensive analysis, Fig. 4 provides the confusion matrices of the BINCTABL model’s predictions for three horizons (\(k=1\), \(k=5\), and \(k=10\)). The confusion matrices demonstrate that the model is slightly biased toward the stationary class. This pattern is consistent across all the models, especially for the first three horizons, reflecting the imbalance of the dataset towards the stationary class, as specified in Sect. 5.

Remarkably, a significant number of models in our study failed to achieve the claimed performance levels. Two possible reasons are the lack of the original code and the missing hyperparameters declaration. Among the models, TRANSLOB and ATNBOF exhibit the largest discrepancies, ranking as the second and first worst performers, respectively. Notably, ATNBoF performs the poorest among all models, both in terms of robustness score and F1-Score.

We observed that CNN1, CNN2, CNNLSTM, TLONBOF, and DLA are the most sensitive models in terms of network weight initialization and dataset shuffling; in fact, these models exhibit a standard deviation over the runs that exceeds 5 basis points, indicating a high degree of variability in their performance. Finally, we highlight that none of the top three models in our study utilize temporal shape \(h=100\) for market observations as input, despite it being a common practice in the literature (Zhang et al. 2019; Tsantekidis et al. 2017a, b; Wallbridge 2020; Tran et al. 2022), meaning that they are able to achieve good results without relying on a large historical context. This suggests that the most influential and relevant dynamics impacting their predictions tend to occur within a short time frame.

Fig. 5
figure 5

Evaluation metrics on different horizons \({\mathcal {K}}\) on FI-2010 dataset

Figure 5 depicts the F1-Score, Accuracy, Precision, Recall, and MCC of the surveyed models obtained through LOBCAST, for the time horizons \({\mathcal {K}}= \{1,2,3,5,10\}\). Most of the models show similar behaviour with respect to the prediction horizons. Specifically, concerning the F1-Score, the lowest performance is noted when \(k=2\). However, an improvement is evident as the value of k increases, indicating a longer prediction horizon. This may appear counterintuitive, as higher values of k imply forecasting the price trend further into the future. However, for very short horizons, the labelling system adopted may be susceptible to noise, affecting the model’s capability to extract relevant patterns. This hypothesis is supported by an experiment in Zhang et al. (2019), in which the authors tried a smoother labelling method and reported a significant decline in the performance of their deep learning model as the prediction horizons increased.

Fig. 6
figure 6

PR-Curve on FI-2010 for time horizon \(k=10\)

Fig. 6 shows the PR-curves of all models for the time horizon \(k=10\), where classes {U, S, D} are distributed as \(\{37\%,25\%,38\%\}\). Since the models perform a ternary classification, each curve represents the micro-average precision-recall value and is generated by setting different thresholds for the classification. Thresholds play a role in defining the number of false negatives and false positives, affecting the resulting values of the Precision and the Recall. The best models are the ones with the largest area under the curve, as they are able to make the most accurate predictions (high Precision) while minimizing the false negative rate (high Recall). The figure also shows the iso-F1 Curves on the PR plane. The best-performing model is BINCTABL, with an area under the curve of 8680.

To further compare the performance of the models, we also conducted a T-test for each couple of models reporting the p-values in Table 10 (Appendix 10.). The sample of scores for each model is made up of the F1-Score varying the random seed. We state the null hypothesis \(h_0\) as: there is no statistical difference between the average performance of the two models. We highlight in bold the values exceeding a threshold \(\alpha = 0.05\), i.e. the couple of models for which \(h_0\) is accepted, that is, the difference in performance has statistical significance.

In Appendix 10.1., we show a different representation of the models’ performances at varying prediction horizons and the agreement matrix of their predictions.

6.3.2 Generalizability on LOB-2021/2022

When comparing the performance of models on the FI-2010 and LOB-2021/2022 datasets, we observe that models showing high performance on the FI-2010 dataset demonstrate a deterioration in performance. Conversely, some of the models that performed poorly on the FI-2010 dataset show an improvement in performance on the LOB-2021/2022 datasets. However, the overall performance of all models on the LOB-2021/2022 dataset is still significantly lower than on the FI-2010 dataset, ranging 48-61% in F1-Score. Furthermore, we conjecture that the overall performance is worse in LOB-2022 than in LOB-2021 due to the higher stocks’ volatility. We mention two potential factors contributing to this observed phenomenon. Firstly, the LOB-2021/2022 datasets present a higher level of complexity than the FI-2010 dataset despite having been generated with a similar approach. Indeed, NASDAQ is a more efficient and liquid market than the Finnish one, as evidenced by the fact that LOB-2021/2022 datasets have approximately three times the size of FI-2010 in terms of events for the same period length. Secondly, the best-performing models may overfit the FI-2010 dataset, leading to a decrease in their performance when applied to LOB-2021/2022 datasets. In particular, BINCTABL experiences an average decrease of approximately \(19.6\%\) in F1-Score across all horizons, resulting in a generalizability score of \(73.5\%\).

Fig. 7
figure 7

F1-Score per stock, time horizon \(k=5\), on LOB-2021

In Fig. 7, we present the results of our tests for the time horizon \(k=5\) on each individual stock from LOB-2021 dataset. Among the tested models, CSCO stands out as yielding the highest performance. This may be attributed to the high stationarity of CSCO (balance 18-65-17% in the train set, see Table 3), indicating more stable and predictable behaviour.

Fig. 8
figure 8

Evaluation metrics on different horizons \({\mathcal {K}}\) on LOB-2021

Fig. 9
figure 9

PR-Curve on LOB-2021 for time horizon \(k=10\)

This hypothesis is supported by the confusion matrices in Fig. 4, which consistently show the best performance in the stationary class across all models; we reported only those of BINCTABL since all other models show similar patterns.

We highlight that extracting the per-stock information on the FI-2010 dataset was impossible because it was already assembled, and the authors did not provide information on that procedure. We also show the performance of the models on LOB-2021, including F1-Score, Accuracy, Precision, Recall, and MCC which are displayed in Fig. 8.

Fig. 9 shows the PR-Curves for the time horizon \(k=10\) on LOB-2021. With respect to the same plot on FI-2010 in Fig. 6, the performance of all methods is more similar to one another, which is in line with the findings reported in Fig. 8. The best-performing method is CTABL, with an area under the curve of 5379.5, which only slightly differs from the other top-performing models. On average, the integral of the curves is 5279.4. This result highlights that the models are less reliable on the LOB-2021 dataset and fall afoul of misclassifying price trends.

Table 11 shows the p-values of the T-test on LOB-2021 for the horizon \(k=5\). We performed this test following the same approach used for the results shown in Table 10. Bold p-values correspond to high accordance among the models. Notice that, while on the FI-2010 dataset, there are only nine couples of models with a p-value exceeding the threshold \(\alpha =0.05\), there are as many as 41 pairs of models that are not considerably statistically different on LOB-2021. These results confirm the results depicted in Figs. 8 and 9.

The plots for LOB-2022 are omitted since they show similar properties; indeed, we observe that most models exhibit a similar trend in both LOB-2021 and LOB-2022 datasets. However, the performance curves in these generalizability tests differ from the results obtained on the FI-2010 dataset, shown in Fig. 5. Specifically, for the LOB-2021/2022 datasets, the F1-Score of most models shows an increasing trend as the prediction horizon increases up to \(k = 3\), after which it starts to decrease.

Table 7 F1-Score on LOB-2021 and LOB-2022. Columns FI 2010, \(\hbox {FI}^r\) 2010, LOB 2021, and LOB 2022, respectively, represent the claimed performance of the models in the respective papers, the performance reproduced with LOBCAST on FI, LOB-2021, LOB-2022

To ease readability, in Table 7 we report the F1-Score of all the models, horizons, and periods.

The performance of the models, as reported by the authors of the selected paper, exhibits changes when evaluated on the LOB-2021 and LOB-2022 datasets. These changes show varying degrees of generalizability among the models.

Notably, the ATNBoF model demonstrates the most substantial improvement with respect to the declared performances, showing an average increase of 12.2% across all prediction horizons. A similar improvement is exhibited by MLP and TLONBoF.

Despite this improvement, ATNBoF still exhibits the lowest overall performance, with an average score of 53.1%. It is worth mentioning that ATNBoF is the most sensitive to random initialization.

In contrast, the other models experience a significant decline in performance when evaluated on LOB-2021 and LOB-2022 datasets. For example, the previously best-performing model on the FI-2010 dataset, BINCTABL, shows an average decrease in F1-Score of approximately 19.6% across all prediction horizons. This decline results in a generalizability score of 73.5% (as mentioned in Table 6). However, despite this decline, BINCTABL remains the top-performing model when evaluated on the LOB-2021 dataset on almost all the prediction horizons. On these datasets, it exhibits similar performance to DEEPLOB and DEEPLOBATT models.

In Appendix 10.2., we show the agreement matrix of the models’ predictions.

6.3.3 Ensemble method discussion

To train METALOB without falling into overfitting, we divided the test set of LOB-2021/2022 into three distinct subsets. We allocated 70% of the data for training, 15% for validating, and the remaining 15% for testing the meta-classifier. By implementing these ensemble methods, our objective was to leverage the collective intelligence of ensemble models and potentially achieve performance that surpasses that of individual models. Unfortunately, the ensemble models did not achieve the expected level of performance, as they failed to surpass the performance of the best individual models. A plausible explanation for this phenomenon is the relatively high degree of consensus among the systems, as evidenced by Fig. 13 and Fig. 14 in Sect. 10. of the supplementary material. Moreover, it is likely that the methods converge on cases that are easy to classify and diverge on cases that are difficult to classify.

6.4 Additional experiments: labeling, non-DL models & profit

In this section, we delve into additional experiments. In Sect. 6.4.1, we measure the impact of labeling parameters on the quality of the SPTP task. In Sect. 6.4.2, we go beyond deep learning models and explore how tree-based methods perform on the SPTP task using the same experimental setting presented in the previous section. In Sect. 6.4.3, we incorporate profit considerations through backtesting.

Fig. 10
figure 10

Different labelling strategies on NFLX and SOFI stocks for \(k=5\)

6.4.1 Labelling

The experiments analyzed in Sect. 6 highlight that the models’ performance does not exhibit a clear trend with respect to the prediction horizon. The labelling method is probably the cause of this phenomenon; in fact, classifying trends based on the mid-price tends to embody noise on the nearest horizons. This hypothesis is supported by the work of Zhang et al. (2019); specifically, they generated a dataset using an alternative labelling method that relies on the mean of the previous and next k mid-prices to identify trends. Interestingly, they show an inverse trend in performance with respect to the horizons: the best performance is achieved with the shortest horizon and deteriorates when it increases. While exploring various labelling techniques is beyond the scope of this benchmark, we provide an initial investigation in this direction. Specifically, focusing on \(k = 5\) in LOB-21, we select two stocks, NFLX and SOFI.

Based on Eqs. 1 and 2, we can define \(\theta _N\) and \(\theta _S\) as the thresholds that balance the occurrences of the classes for the stocks NFLX and SOFI, respectively. Similarly, we can define \(\theta _0\) as the threshold that balances the occurrences of the classes for the ensemble of six stocks within the dataset.

Figure 10 shows the results of three different training settings: (i.) ALL (\(\theta _{\textbf{0}}\)) represents the training of the models over the ensemble of all the six stocks using the threshold \(\theta _0\); (ii.) NFLX (\(\theta _{\textbf{0}}\)) (SOFI (\(\theta _{\textbf{0}}\))) represents the training of the models over NFLX (SOFI) stock using the threshold \(\theta _0\). (iii.) NFLX (\({\theta }_N\)) (SOFI (\({\theta }_S\))) represents the training of the models over NFLX (SOFI) stock using the threshold \(\theta _N\).

In the case of SOFI, all methods, except for BINCTABL, achieve the highest performance in the ALL (\({\theta _0}\)) setting. This indicates that these models are able to extract useful signals from other stocks, reducing overfitting and improving overall performance. On the other hand, comparing the SOFI (\({\theta }_0\)) and SOFI (\({\theta }_S\)) settings does not provide significant insights. This suggests that the balancing of the three classes is not crucial for achieving higher performance. This is even more the case for NFLX in Fig. 10a, considering that the imbalance due to \({\theta }_0\) is much higher (see Table 3).

These results indicate that the labelling mechanism should be revised from its current definition and be agnostic with respect to the balancing involved. Therefore, trend definition should not solely depend on the magnitude of the future price shift relative to the current price. Other factors, such as persistence over time and volume considerations, should also be taken into account. A more comprehensive discussion of the limitations and challenges associated with the labelling mechanisms can be found in the final discussion and conclusions section.

6.4.2 Random forest & XGBoost for SPTP

To test the quality of the predictions of DL over non-DL models, we conducted an empirical investigation focusing on the predictive capabilities of two of the most popular non-DL models: Random Forests and XGBoost. These experiments are motivated by the results presented in some previous works (e.g., Grinsztajn et al. 2022; Shwartz-Ziv and Armon 2022), which show that some tree-based models outperform recently proposed DL models on tabular data. Employing the standard experimental setup detailed in Sect. 6 and carefully tuning hyperparameters (refer to Table 8 and Table 9) our analysis revealed an F1-Score of 51% for the Random Forest model and 65% for XGBoost on the FI-2010 dataset with horizon \(k=5\). We obtained these results using class weights and the hyperparameter values in bold in the Tables.  We acknowledge that for Tree-based algorithms, normalization might lead to worse performance, but as said before, the FI-2010 dataset is released already normalized, and to guarantee a fair comparison, we decided to apply standardization to the LOB 2021/22 dataset.

As illustrated in Fig. 5a and Fig. 12, our results indicate a competitive performance of these non-DL models when compared to several DL models, including ATNBoF, MLP, TLONBoF, TRANSLOB, and DAIN. However, the hypothesis that non-DL models outperform DL counterparts in this specific task does not seem to hold. While non-DL models exhibit notable performance, DL methods show a substantial advantage in predicting price trends. We recall that the F1-Score of the best-performing model on the FI-2010 with \(k=5\) was 87.7%, obtained by BINCTABL. This is in line with the results of the experiments in Nti et al. (2020). The tabular nature of LOB data, with its geometric properties, such as local dependence Sirignano (2019), and visual indicators embedded in column positions of the LOB, seems to align better with the strengths of DL models, especially those based on convolution. Furthermore, we remark that LOB data are multivariate time series, and SOTA forecasting papers in this domain are primarily dominated by deep learning models (Mahmoud and Mohammed 2021; Torres et al. 2021; Lim and Zohren 2021).

Table 8 Random Forest parameters
Table 9 XGBoost parameters
Fig. 11
figure 11

Distribution of returns on five seeds

6.4.3 Profit analysis

As a final benchmark test, we conducted a trading simulation using our framework, relying on Backtesting.py Python library.Footnote 3 As highlighted by Olorunnimbe and Viktor (2023), most of the existing literature in the SPTP field neglects backtesting, even though it is essential for evaluating the performance of algorithmic trading strategies and for potential real-world use.

We performed backtesting using the same period as the test set of the LOB-2021 dataset, i.e., from 2021-07-13 to 2021-07-15. To perform backtesting, we generated an Open High Low Close (OHLC) time series with a 10 events period. The OHLC is an aggregation technique to summarize periods of a time series, e.g., minutes, hours, days, or a number of events (10 in this case). Each data point of the series represents four aggregates of the considered period. The Open represents the first price of the period; High is the highest price of the period; Low is the lowest price of the period; Close is the last price of the period.

We underline that the use of LOB data is most often associated with High-Frequency Trading (HFT), i.e., strategies that analyze this data in real time to make split-second decisions about trade executions. We remark that a trading action (buy/sell/hold) is taken every ten events, so at the end of the backtesting simulation, for each stock, hundreds of thousands of orders are placed and filled.

We base our trading simulation on the methodology of the seminal paper Zhang et al. (2019) in this field, in which the authors conducted a similar experiment. We established certain parameters for our simulation. Firstly, we set the number of shares per trade to a fixed value of 1, simplifying our analysis and assuming a negligible market impact. Furthermore, our simulated trader begins with an initial capital of $10.000, and we make the assumption of no transaction fees.

The trading strategy relies on the models and operates by generating signals every 10 events to predict subsequent price movements. These signals, categorized as up, stationary, or down, determine the trading action. When the signal is up, the simulated trader places a buy order. Conversely, if the signal is down and the trader currently holds a long position, he places a sell order. In cases where the signal is stationary, the trader takes no action. The orders are filled at the next open price.

The results of the trading simulation for each stock are presented in Fig. 11. The strongest correlation observed is between the daily returns of the stocks, as shown in Table 3, and the returns of the strategy described above. In fact, the two stocks with the highest positive daily returns (namely LSTR and NFLX) are the only ones for which the strategy is profitable. On the other hand, the two stocks with the highest negative daily returns (SOFI and SHLS) are the ones for which most models show a negative return. Another correlation, albeit less strong, is between the volatility of the stocks and the return of the models. Specifically, lower volatility is associated with higher model returns.

We recognize the limitations of this simulation. For instance, we do not perform portfolio optimization or position sizing, we assume the trades execution at the mid-price, and we ignore transaction costs, but a realistic and sophisticated algorithmic trading simulation is beyond the scope of this study and remains an interesting aspect for future research.

7 Discussion and conclusions

Our findings highlight that price trend predictors based on DNNs using LOB data are not consistently reliable as they often exhibit non-robust and non-generalizable performance. Our experiments demonstrate that the existing models are very susceptible to hyperparameter selection, randomization, and experimental context (stocks, volatility, market, historical period). In addition, the experimental setup fails to capture the intricacies of the real-world scenario. This lack of robustness and generalizability makes them inadequate for practical applications in real-world settings.

7.1 Models

Our results lead to a crucial observation: on the LOBSTER dataset, SOTA DL models for LOB data exhibit low generalizability. We suggest that this phenomenon is due to two factors: the higher complexity of the LOBSTER dataset compared to the FI-2010 dataset and the overfitting of the best-performing models to the FI-2010 dataset, which lowers their performance on the LOBSTER dataset. In fact, in the original papers, all the considered models (except for DEEPLOB, DEEPLOBAT and DLA) were trained, validated, and tested only on FI-2010, which is a smaller dataset with less frequent and voluminous orders than those contained in LOBSTER-derived datasets. Our conjecture is supported by insightful findings reported in the existing literature: several works (e.g., Orimoloye et al. 2020; Ruff et al. 2021; Najafabadi et al. 2015) have shown that while some datasets exhibit simple patterns that can be effectively captured by a shallow model, others may require deeper architectures to model complex relationships.

Another key finding of this study is that the top models with the highest performance on both datasets employ attention mechanisms. This suggests that the attention technique enhances the extraction of informative features and the discovery of patterns in LOB data. However, in general, it appears that current models cannot cope with the complexity of financial forecasting with LOB data.

7.2 Dataset

Financial trends can be influenced by both local and international political events, in fact, political actions and decisions can significantly impact economic conditions, market sentiment, and investor confidence Engle et al. (2013). These factors are not captured by LOB data alone. For this reason, we believe that price predictors may benefit from integrating LOB data with additional information, for example, sentiment analysis relying on social media and press data, representing an easily accessible source of exogenous factors impacting the market Ren et al. (2018). This is particularly true for mid- and long-term price trend prediction, whereas it might not hold for HFT strategies Bouchaud et al. (2018). We remark that micro and macroscopic market trends are fundamentally different, and the microscopic behaviour of the market is very much driven by HFT algorithms, making it almost exclusively dependent on financial movements rather than external factors. In this scenario, granular and raw LOBs may suffice to provide data for price trend prediction. Several works have used sentiment analysis for price trend prediction. The work in Jin et al. (2020) combines comments made by investors on the online platform StockTwits with Apple stock price time series and uses LSTM for stock closing price prediction. By creating a dataset composed of tweets and historical stock prices, the work in Xu and Cohen (2018) proposes a new architecture for stock price prediction. Similar approaches are used in Li et al. (2014), Nguyen et al. (2015), where the authors show that the performance of their predictors improves when their dataset is enriched with sentiment data. Despite the wide research in this field, to the best of our knowledge, no one has ever used LOB data together with sentiment analysis for price trend prediction. The existing literature suggests that this could be a valuable research direction.

Another weakness in dataset generation is the potential for training, validation, and test splits to have dissimilar distributions. This occurs due to the distinct characteristics of the historical periods covered by the stock time series. This can negatively affect the model’s ability to generalize effectively and make reliable predictions on unseen data. A last limitation regarding the dataset is the representation of the limit order book which has been shown to be sensitive to permutations by Wu et al. (2021). In Wu et al. (2022), the same authors proposed robust alternative representations.

7.3 Labelling

As we discussed in Sections 2, 5 and 6.3, the choice of the threshold for class definition in Eq. 2 plays a crucial role in determining the trend associated to a market observation. We believe that current solutions present room for improvement. As discussed in Sect. 5, in FI-2010, the parameter \(\theta\) was chosen to obtain a balanced dataset in the number of classes for the horizon \(k=5\) (which is the mean value of the considered interval in the set \({\mathcal {K}}\)). Thus, \(\theta\) is not chosen in accordance with its financial implication but rather serves the purpose of improving model performance. We recall that the dataset is made of different stocks. With such a labelling system, with a fixed \(\theta\), stocks with low volatility become associated with stable trends, as their behaviour is overshadowed by stocks exhibiting higher volatility. Good practices that could be investigated are to use a weighted look-behind moving average to absorb data noise instead of mid-prices as in Eq. 2 or to define a dynamically adapting \(\theta\) which accounts for changing trends of a stock’s mid-price. Moreover, the labelling approach of Eq. 2, used by all surveyed models, fails to leverage important aspects available in LOB data, so another possible improvement is the definition and use of other insightful features that can be extrapolated from the LOB in addition to the mid-price. Such values could encapsulate other peculiar and informative features, such as stocks’ spread and volumes which directly influence stock volatility and so the returns.

7.4 Profit

In the context of stock prediction tasks, it is of utmost importance to go beyond standard statistical performance metrics such as accuracy and F1-Score and incorporate trading simulations to assess the practical value of algorithms. SPTP predictors’ ultimate measure of success lies in their ability to generate profits under real market conditions. It is essential to conduct trading simulations using real simulators that go beyond testing on historical data. Recent progress has been made in the context of reactive simulators (Coletta et al. 2022a, b; Mizuta 2016; Shi and Cartlidge 2023).

7.5 Limitations and risks

We acknowledge that our study is subject to some limitations, which should be considered when interpreting our findings. First, we conducted a grid hyperparameter search for the models which did not specify them. Since hyperparameter search is not exhaustive, our chosen best hyperparameters could potentially undermine the quality of the original systems. Secondly, due to computational resource limitations, we could not train the benchmarked models on LOB datasets spanning longer periods, e.g., years rather than weeks. We recognize that doing so could have led to different results.

We also highlight that using DL or, more generally, AI models for solving SPTP and exploiting them for trading can have a number of risks. Some of them are inherently technical. This is the case for data biases, that is, incomplete or unrepresentative data, which can cause predictive algorithms to favor groups that are better represented in the training data Boukherouaa et al. (2021). Lack of explainability is another risk that could expose organizations to vulnerabilities (such as biased data, unsuitable modeling techniques, or incorrect decision-making) and potentially undermine the trust in their robustness Silberg and Manyika (2019). AI models for trading are also vulnerable to cyber-attacks. Malicious users can exploit AI model vulnerabilities to evade detection and prompt the models to make the wrong decisions or to extract information by manipulating data at some stage of the model lifecycle Comiter (2019). Other risks have an ethical nature and can impact financial stability. One of them is inequality among investors. As training and predicting are expensive in terms of hardware equipment and energy, and because of the challenges in model interpretation and prediction, AI trading can lead to a concentration of information among those who can afford the required technology, exacerbating income inequality and asymmetry in the market, with an uncertain impact on financial stability (Robledo Costales 2023; Boukherouaa et al. 2021). Because of limited regulation of the AI trading systems, there is a lack of transparency, making it difficult to detect possible unfair strategies Robledo Costales (2023). Governing legal and regulatory framework regulators should welcome the advancements of AI in finance and undertake the necessary preparations to harness its potential advantages and address its associated risks.

8 Disclaimer

This paper was prepared for informational purposes in part by the Artificial Intelligence Research group of JPMorgan Chase & Co. and its affiliates (“JP Morgan”), and is not a product of the Research Department of JP Morgan. JP Morgan makes no representation and warranty whatsoever and disclaims all liability, for the completeness, accuracy or reliability of the information contained herein. This document is not intended as investment research or investment advice, or a recommendation, offer or solicitation for the purchase or sale of any security, financial instrument, financial product or service, or to be used in any way for evaluating the merits of participating in any transaction, and shall not constitute a solicitation under any jurisdiction or to any person, if such solicitation under such jurisdiction or to such person would be unlawful.