1 Introduction

From the start of the 20th century, the financial sector has made consistent investments in researching price prediction and market dynamics models [1]. For the purpose of forecasting stock price trends, conventional quantitative approaches rely on historical time series price data [2]. In recent days, leveraging models to analyze financial time series has become essential for managing market risks and making informed investment decisions [3].

An escalating quantity of progressively advanced models is being introduced in research works to address the inherent intricacies of time series data within specific domains. Notably, stock market data is quite complex, as it is characterized by a multi-dimensional, volatile, and dynamically evolving nature. Furthermore, stock market data displays interconnections with various external factors, such as macroeconomic events and news disseminated by media sources. Consequently, appropriately integrating these factors is of paramount importance when developing predictive models that yield satisfactory levels of accuracy.

Autoregressive models [4,5,6,7,8,9,10] are proficient in addressing prediction tasks that take into account temporal auto-correlation, but fall short in adequately accounting for the multivariate nature of the data and the intricacies of nonlinear feature interactions. On the other hand, machine learning and deep learning models tailored for temporal data, exemplified by long short-term memory models and their various adaptations [11,12,13,14,15,16,17,18,19,20,21], are capable of mitigating these limitations. However, their utility is typically confined to the analysis of a single data source, thereby diminishing their effectiveness in highly volatile and challenging-to-predict domains. Therefore, it is advisable to explore ensemble-based and hybrid combination methods [22,23,24,25,26,27,28,29,30,31,32] as they hold greater promise compared to other approaches. These methods can encompass multiple data sources and harness the diversity of multiple predictors, offering a more comprehensive solution to address the challenges of data analysis in such complex and dynamic domains.

A significant drawback of many works in the literature is that they are limited to the analysis of a single source of data (or modality). Some studies highlighted that the analysis of financial news in addition to stock prices may play a key role in stock market prediction [33, 34]. For this reason, researchers are focusing on devising new and more sophisticated ways to integrate different relevant sources of data that may impact stock prices, resulting in more accurate models.

However, several methods that consider multiple sources of data, address them in isolation, relying on simple combination strategies to perform a joint analysis. Moreover, the specialized terminology and scarcity of labeled data in the financial industry exacerbates the difficulty of accurately performing sentiment analysis, making general-purpose text-based models insufficient. To this aim, language models fine-tuned on financial textual data present provide new exciting opportunities for the integration of accurate textual analysis in stock market prediction models [35].

In this paper, we propose a multimodal deep learning model for stock market trend prediction that consists of two branches: a FinBERT branch which specializes on the analysis of the textual content in financial news and accurately model market sentiment, and a LSTM branch which captures temporal market dynamics in complex multivariate data, including stock prices and technical indicators. Our deep fusion approach allows us to effectively leverage multiple modalities leading to improved generalizability, reduced bias, and increased efficiency compared to single-modality approaches.

In summary, the main contributions of our paper can be summarized as follows:

  • We propose a deep fusion model architecture for stock market trend prediction that seamlessly considers and integrates multiple modalities (stock prices, technical indicators, news headlines) in a joint feature space with multiple specialized branches, empowering the model with a more comprehensive understanding of patterns and complex nonlinear relationships in stock market dynamics that leads to the extraction of more robust and trustworthy next-day trend predictions;

  • We devise an end-to-end optimization and hyperparameter tuning workflow which allows us to identify and select highly effective configurations for each branch of the fusion model, resulting in a competitive stock prediction performance tailored to the characteristics of a specific stock under analysis;

  • We perform an extensive evaluation of 12 real-world stocks from different sectors in two different evaluation periods and with different market conditions (uptrend, downtrend). This evaluation encompasses two analytical perspectives: model accuracy and portfolio performance in a realistic simulation, where model predictions are leveraged for practical automated trading decisions. Our experimental results show that our approach can outperform state-of-the-art methods for stock market prediction.

The paper is structured as follows. Section 2 summarizes relevant works for stock market prediction. Section 3 describes our proposed method in detail. Section 4 discusses our experimental settings and the results obtained in our experiments. Section 5 wraps up the paper and discusses relevant directions for future work.

2 Related work

In this section, we review relevant works pertaining to time series prediction and forecasting, with particular focus on stock market analysis.

2.1 Autoregressive models

Autoregressive models are recognized for their ability to characterize associations across multiple time steps for a target feature through the learning of coefficients. One prominent autoregressive technique is the autoregressive integrated moving average (ARIMA) model [4]. Renowned for its efficacy in short-term prediction tasks, ARIMA forecasts the future value of a variable by linearly combining past values and errors, following the application of differentiation operations to render the time series stationary. The methodology outlined in [5], for instance, employs an ARIMA model to predict coronavirus cases using Johns Hopkins epidemiological data. Prophet [6] stands out as another prevalent autoregressive forecasting method, grounded in an additive model that accommodates nonlinear trends through yearly, weekly, and daily variations. This approach also addresses seasonality and holiday effects, demonstrating robustness to outliers and missing data. Prophet aims at enhanced configurability and user-friendliness in comparison with ARIMA. Vector autoregression (VAR) [7] is another noteworthy autoregressive approach that extends beyond predictive tasks for single variables. This method concurrently learns coefficients for multiple variables, considering their temporal correlations. A noteworthy investigation in [8] highlights its effectiveness in forecasting tasks embedded within a spatiotemporal context. In the realm of stock market applications, recent research [9] introduces a moving average heterogeneous autoregressive (MAT-HAR) model, treating thresholds as a moving average-generated, time-varying parameter. This model is employed to forecast the monthly realized volatility of the US stock market. Another study [10] applies univariate ARIMA models to the Amman Stock Exchange. Despite their effectiveness in numerous applications, autoregressive models exhibit certain limitations. Besides being limited to the analysis of single variables or modalities, their simplicity makes them incapable of capturing nonlinear relationships between multiple variables, which are frequently encountered in real-world multivariate data.

2.2 Machine learning and deep learning models

Machine learning and deep learning models tailored for temporal data, such as long short-term memory (LSTM) [36] models and their variations, constitute an advancement over autoregressive models due to their capacity to effectively analyze multivariate data and handle nonlinear feature interactions. In the research by [12], recurrent neural network (RNN) models featuring long short-term memory units are proposed to predict pollutant particle levels at multiple time horizons. In the domain of stock market analysis, [13] employs an LSTM model to predict the next-day closing price of the S &P 500 index, utilizing nine predictors selected from fundamental market data, macroeconomic data, and technical indicators. The authors in [14] introduce an LSTM-based model architecture for forecasting air leaks, assessing its potential within the healthcare sector. The work in [15] devises a tensor decomposition approach for feature extraction, where predictive clustering trees are used for forecasting, and their performance is compared to LSTM models. A comparative investigation of deep neural networks with LSTM networks for stock market analysis is presented by [16], focusing on daily and weekly movements of the Indian BSE Sensex index. Another work by [11] empirically analyzed LSTM networks leveraging a diverse set of real-world datasets, and identified that such models are quite effective in predicting stock market prices. The study in [17] conducts a comparative analysis involving LSTM, gated recurrent unit (GRU), and drop-GRU models in the context of power consumption forecasting, demonstrating the satisfactory performance of the devised models in this application. Combinations of GRU and convolutional neural networks (CNN) have also been explored. For instance, the GRU-CNN model proposed in [37] has shown to be effective for stock market prediction. A decision support system reinforced with LSTM for swing trading is proposed in [18], where predictions and reports that incorporate forecasted values of company stock for the next 30 days are extracted, alongside technical indicators. In the research by [19], bidirectional and stacked LSTM predictive models are benchmarked against shallow neural networks and simplified forms of LSTM networks, with analyses conducted on publicly available stock market data.

The work in [20] demonstrates that LSTM networks combined with bidirectional gated recurrent unit (BiGRU) can accurately predict the closing price of stock market, offering a more competitive performance than simpler models. In [21], a bidirectional LSTM model (Bi-LSTM), proposed for the first time for speech recognition tasks [38], is adopted and optimized by particle swarm optimization (PSO), giving place to a PSO-Bi-LSTM approach to predict useful long-, mid-, and short-term investment strategies. A CNN-LSTM model complemented by an attention mechanism was proposed in [39]. Dilated convolutions have been explored in [40] and have shown great success in extracting multi-scale patterns at different time granularities. A common limitation of these approaches is their confinement to the analysis of a single data type or modality, which constrains their effectiveness in the presence of highly volatile phenomena that depends on multiple factors.

2.3 Ensemble-based and hybrid models

Ensemble-based and combination methodologies involving hybrid models offer a robust approach to address these intricacies, encompassing the utilization of multiple data sources and the amalgamation of various predictors. An AI platform, as proposed by [22], leverages four machine learning ensemble methods, namely neural network regression ensemble, support vector regression ensemble, boosted regression trees, and random forest. The best ensemble method for a given stock is selected through a cross-validation evaluation. In [23], a fusion network is proposed to extract text and numerical information for stock price prediction, with the addition of an attention mechanism to improve the overall model performance.

A stacking ensemble approach for predicting stock closing prices is proposed in [24], where a competitive performance is obtained when contrasted with conventional machine learning ensemble models such as random forest, AdaBoost, and gradient boosting machines. A stacking approach was also explored in [41] with joint consideration of news headlines, multivariate time series data, and multiple base models as predictors. Authors in the work by [25] propose a hybrid forecasting model for stock prices that integrates various deep learning models, specifically, CNN-LSTM [42], GRU-CNN [37], and ensemble models. The work in [26] introduces a hybrid model denoted as PCA-EMD-LSTM, which combines principal component analysis, empirical mode decomposition, and LSTM for predicting stock market trends in Thailand. The hybrid model proposed in [27] utilizes decomposition techniques, multi-factor analysis, and attention-based LSTM to forecast stock market price trends in four major Asian countries. In [28], a hybrid method for analyzing stock markets is introduced, which combines an autoencoder-based feature extraction network with a temporal convolutional model architecture and a temporal clustering optimization algorithm utilizing the KL (Kullback–Leibler) divergence. The approach in [43] employs a CNN model to perform sentiment classification and integrates it to a LSTM analyzing technical indicators, showing that the joint consideration of both aspects leads to improved predictions. A deep learning approach is proposed in [29], where future stock prices are predicted by a blending ensemble learning model that combines two recurrent neural networks followed by a fully connected neural network. The authors in [30] conduct an analysis of the collective sentiment’s significance on popular S &P500 stocks and assess its efficacy in investment decision-making. A study in [31] presents a framework based on LSTM and convolutional neural networks to predict the closing prices of Tesla and Apple, utilizing historical data collected over the past two years. Two stock trading decision methods have been applied in [32]: nested reinforcement learning (Nested RL) using three deep reinforcement learning models, and a weighted random selection with confidence (WRSC) strategy. The results show that their approach outperforms baselines, enhancing portfolio management for higher profits at the same risk level.

3 Method

In this section, we describe our method in detail, focusing on its workflow. The proposed method involves a multimodal deep learning approach that combines information from historical stock prices and statistical indicators with news headlines. The model employs long short-term memory (LSTM) networks and bidirectional encoder representations from transformers (BERT) models to capture both quantitative and qualitative information. For the text component, we leverage FinBERT to conduct sentiment analysis. A fusion layer combines the two modalities and yields a final stock price prediction. A graphical representation of the method is shown in Fig. 1.

3.1 Data gathering, preprocessing, and fusion

Our method aims to comprehensively analyze stock prices, statistical trading indicators, and news headlines. To achieve this goal, we obtain data from two Application Programming Interfaces (APIs). The initial API, referred to as Yahooquery,Footnote 1 is employed for retrieving historical stock prices, serving as an unofficial substitute for the obsolete Yahoo Finance API. We input stock tickers in string format, and the API provides all historical data available for a specified stock within a given date range. The data obtained from the API comprises the opening price, high and low values, adjusted closing prices, and the daily observed volume. Subsequently, we employ the TA-lib Python libraryFootnote 2 to incorporate computed statistical indicators into the extracted stock data. These statistical indicators, widely used in technical analysis, encompass the exponential moving average (12-day, 26-day), moving average convergence/divergence (MACD), parabolic SAR, Bollinger bands (upper band, middle band, and lower band), and stochastic (Slow k, Slow d). Before model training, all numerical data undergoes min-max normalization. In our study, we do not perform feature selection to identify a subset of suitable features for each stock. Although some features may be more relevant than others for a given stock under analysis, the adopted deep learning architecture should, in principle, automatically learn feature influence via gradient descent optimization. Specifically, weak features will be characterized by small weights with a vanishing effect in the deeper layers of the network and a tendency to be discarded for prediction. In contrast, relevant features will lead to strong/high activation values that influence the prediction significantly.

As a secondary API, we utilize the end-of-day historical financial data (EODHD) API to fetch news headlines based on a given stock ticker. For instance, a query with the “aapl” ticker as input yields data on news articles, including posting time, titles, article content, URL links, as well as tagged symbols and tickers. To maintain focus on reliable news sources, we exclusively retain articles sourced from Yahoo Finance, discarding those from other origins.

3.2 Long short-term memory (LSTM)

Long short-term memory (LSTM) neural networks represent a category of recurrent neural networks (RNN) extensively applied in the analysis of time series data, owing to their ability of capturing prolonged dependencies within sequential data [44]. The utility of LSTM models lies in their capacity to discern and forecast patterns in time series data, which makes them valuable for predictive tasks. LSTMs address the challenge of vanishing and exploding gradients encountered in traditional RNNs [45, 46] by introducing memory cells to replace recurrent nodes. A distinguishing feature of a memory cell is its internal state, facilitating the flow of gradients across multiple time steps without vanishing or exploding [44].

Each memory cell comprises multiple nodes referred to as gates. The data from the current time step is fed into these LSTM gates, as well as into the hidden state from the preceding time step. Subsequently, three fully connected layers compute the values associated with the input, forget, and output gates. A sigmoid activation function is applied to these values to yield the final output, constrained within a (0, 1) range.

An input node undergoes computation through a tanh activation function. In essence, the gates modulate the significance of the information passed to the model at distinct time steps. The input gate gauges the proportion of the input node’s value to be added to the current internal state of the cell. The forget gate determines whether the prevailing value of the cell should be retained or discarded. Finally, the output gate decides if the memory cell should contribute to the output of the ongoing time step.

Assuming the presence of d inputs, h hidden units, and a batch size of n, the input is defined as \(\textbf{X}_t \in \mathbb {R}^{n \times d} \), and the hidden state of the previous time step is defined as \(\textbf{H}_{t-1} \in \mathbb {R}^{n \times h} \). The gates at time step t are defined as follows: the input gate is \(\textbf{I}_t \in \mathbb {R}^{n \times h}\), the forget gate is \(\textbf{F}_t \in \mathbb {R}^{n \times h}\), and the output gate is \(\textbf{O}_t \in \mathbb {R}^{n \times h}\). Formally, they are calculated as:

$$\begin{aligned} \textbf{I}_t&= \sigma (\textbf{X}_t \textbf{W}_{xi} + \textbf{H}_{t-1} \textbf{W}_{hi} + \textbf{b}_i), \end{aligned}$$
(1)
$$\begin{aligned} \textbf{F}_t&= \sigma (\textbf{X}_t \textbf{W}_{xf} + \textbf{H}_{t-1} \textbf{W}_{hf} + \textbf{b}_f),\end{aligned}$$
(2)
$$\begin{aligned} \textbf{O}_t&= \sigma (\textbf{X}_t \textbf{W}_{xo} + \textbf{H}_{t-1} \textbf{W}_{ho} + \textbf{b}_o), \end{aligned}$$
(3)

where \(\textbf{W}_{xi}, \textbf{W}_{xf}, \textbf{W}_{xo} \in \mathbb {R}^{d \times h}\) are weight parameters and \(\textbf{b}_i, \textbf{b}_f, \textbf{b}_o \in \mathbb {R}^{1 \times h}\) are bias parameters.

The incorporation of LSTM cells equips the model with the ability to address intricate temporal patterns in multivariate data, enabling the capture of nonlinear and enduring relationships among various features and timestamps. This capability is leveraged to allow the model to discern resilient patterns within historical data, encompassing statistical indicators, and facilitate the extraction of relationships between stock prices and other descriptive features.

3.3 FinBERT

BERT (bidirectional encoder representations from transformers) is a complex deep neural network model for natural language processing (NLP). BERT achieved state-of-the-art results in various NLP tasks such as text classification, question answering, and named entity recognition. BERT uses a transformer architecture that allows capturing long-range dependencies and context in text data, making it highly effective for tasks involving understanding and processing human language. The high accuracy documented in several research works supports the adoption of BERT as a versatile model for many different NLP tasks [47]. Among them, BERT is often used to extract contextual embedding vectors from text, which can be adopted for subsequent downstream tasks. However, the performance of the model is strictly related to the pertinence of the dataset used to train the model. While using pre-trained general-purpose language may be a practical solution to avoid expensive training costs, it may result in a poor representation of topic-specific textual content [48]. To overcome this limitation, we leverage FinBERT [35], a language model specialized for financial data analysis, which obtained the highest scores on FiQA sentiment scoring and Financial PhraseBank benchmarks, outperforming other popular large language models including GPT-4 [49].

The model architecture consists of multiple stacked transformer layers, which allow the model to capture complex contextual representations. Each layer features a self-attention mechanism, which computes the weighted sum of values (V) based on queries (Q) and keys (K):

$$\begin{aligned} \text {Attention}(Q, K, V) = \text {softmax}\left( \frac{QK^T}{\sqrt{d_k}}\right) V \end{aligned}$$
(4)

The model adopts multiple attention heads, which can be formalized as:

$$\begin{aligned} \text {MultiHead}(Q, K, V) = \text {Concat}(\text {head}_1, \text {head}_2, \ldots , \text {head}_h) W^O, \end{aligned}$$
(5)

where \(\text {head}_i = \text {Attention}(QW_i^Q, KW_i^K, VW_i^V)\).

The output of each transformer can be computed as:

$$\begin{aligned} \text {LayerOutput} = \text {LayerNorm}(x + \text {MultiHead}(x) + \text {FFN}(x)), \end{aligned}$$
(6)

where FFN is a simple feed-forward neural network, and \({LayerNorm}(x) = \frac{x - \mu }{\sigma }\) is the layer normalization, with \(\mu \) and \(\sigma \) being the mean and standard deviation, respectively.

To prevent catastrophic forgetting, FinBERT applies three state-of-the-art techniques: slanted triangular learning rates, discriminative fine-tuning, and gradual unfreezing.

FinBERT takes an initial BERT model trained on BookCorpus and Wikipedia, an fine-tunes it on the TRC2-financial corpus, a subset of Reuters’ TRC24, which consists of 1.8M news articles published by Reuters between 2008 and 2010. Subsequently, FinBERT is fine-tuned on Financial Phrasebank corpus consists of 4845 English sentences from financial news found on LexisNexis database, annotated by 16 people with background in finance and business [50].

FinBERT extracts sentiment scores for all news headlines gathered for a specific stock on a given day. It returns a positive, neutral, and negative score for each news. For textual data, we remove stopwords, punctuation marks, square brackets, and lowercase, in order to reduce noise and focus on meaningful words. Initially, we obtain a summary of the day consisting of two values: sum of positive scores, and sum of negative scores. Subsequently, the largest of the two scores determines if the day is overall positive or negative. Based on this information, we select the most representative news headline, i.e., the one with the largest positive or negative score, and extract its embedding vector representation.Footnote 3

We note that sentiment scores extracted by FinBERT are used exclusively for news selection. Separately, we fine-tune FinBERT with the financial news in our dataset, and we replace the output layer with a single-unit dense layer (to predict uptrend/downtrend directly) and optimize it considering different hyperparameter configurations (see Table 2). The best-performing configuration is identified based on accuracy using a validation set.

Afterward, in the proposed model architecture, we remove the classification layer used during optimization and exploit the embedding vector representation of the news for subsequent fusion.

3.4 Multimodal fusion

Our novel multimodal fusion approach is tailored for next-day stock market trend prediction. This branch of the model is responsible for fusing the two data modalities: time series and text. More in detail, the model incorporates time series data processed through LSTM (long short-term memory), a type of recurrent neural network renowned for its effectiveness in handling sequential data, and text embeddings processed through FinBERT, a specialized model specifically tailored for the analysis of financial text data. The primary objective of this approach is to enhance prediction accuracy and robustness by fusing information from different data sources or modalities.

The structural layout of the model is visually depicted in Fig. 1, illustrating how the two data modalities are seamlessly integrated. This model leverages multimodal learning, which empowers the model with a more comprehensive understanding of the underlying patterns and relationships within the data. This, in turn, can result in an improved ability to withstand unexpected market fluctuations and enhance prediction resilience [51]. This observation is substantiated by prior research conducted across various applications.

Fig. 1
figure 1

Overview of the proposed multimodal fusion model for stock market prediction

The fusion of these data modalities is achieved through a specific process involving a concatenation layer and a series of dense layers.

The temporal granularity of data processed by the different model branches is aligned. For each day, the LSTM model processes a single multivariate data instance containing stock prices and technical indicators. Likewise, the FinBERT model leverages the most representative news headline of the day (as explained in the previous subsection). Since the downstream task of interest is next-day stock trend prediction, a daily time granularity is appropriate, and it allows us to train models efficiently considering a large time frame. Both model branches generate a vector embedding which is subsequently provided to the concatenation layer and results in the vertical concatenation of the two vector embeddings.

More in detail, for FinBERT, the preprocessing phase involves tokenizing the input text for BERT, adding special tokens like [CLS] and [SEP]. The embeddings generated by BERT offer contextual representations for each token, capturing nuanced contextual relationships. These BERT embeddings are then merged into the LSTM embeddings by concatenating them with the input sequences, allowing the multimodal model to leverage the rich contextual information from both BERT and LSTM. Notably, the configuration of these layers is customized to suit the dataset’s characteristics and the particular prediction task at hand. This strategy for model architecture optimization was proven to be beneficial in [52]. Details of the architecture’s optimization and tuning are shown in Table  3, which contains information on the various hyperparameters and configurations considered in the optimization process. We conduct a tuning process leveraging AAPL, TSLA, and MSFT stocks to identify an effective model architecture configuration (layers, number of neurons, etc.).

Table 1 Details on hyperparameter tuning: LSTM branch
Table 2 Details on hyperparameter tuning: FinBERT branch
Table 3 Details on hyperparameter tuning: Multimodal Fusion branch

To this end, LSTM and FinBERT base models are optimized independently, and are then combined in the multimodal architecture, which is further optimized. For hyperparameters used in this process, see Tables 1, 2, and 3, respectively. The selected values for the hyperparameters optimization stage are motivated by works in the literature providing effective heuristics. Specifically, for learning rate, the authors in [53] suggest to start from a default value of 0.01 and experiment with a decreasing factor (negative power of 10), where \(10^{-6}\) is considered as an extremely small value. Dropout and batch normalization have been widely recognized as beneficial as a regularization technique to reduce overfitting in neural networks. As for the dropout rate, works in [54] and [55] have shown that values below 0.5 should be preferred, in order to avoid the removal of too many neurons that would cause an under-learning phenomenon. The number of LSTM and dense layers, as well as the number of neurons in each layer, also play an important factor, where multiple layers generally allow the model to learn higher-level representations. Although their ideal value can be domain specific, a restricted set of multiples of 2 (layers) and 20 (neurons) are usually good candidates [56]. Models are optimized on the training data (first \(75\%\) of days), and the best-performing architecture for base models and for the multimodal branch is then selected to conduct actual experiments with all stocks.

Our study’s priority is a diverse and comprehensive evaluation of model performance. Considering that all data modalities are required for evaluating the proposed multimodal model, we focus on two representative evaluation periods: uptrend (June to December 2021) and downtrend (January to September 2022), where this condition holds (i.e., financial news is unavailable before 2021). This choice allows us to showcase and discuss model performance in different market conditions. For each day in the evaluation period, models are trained up to the previous day, and predict the stock trend for the current day, leading to a sliding window fine-tuning and evaluation approach. For the LSTM branch, the complete set of hyperparameters optimized in our experiments is shown in Table 1. The complete set of hyperparameters optimized for the FinBERT branch is shown in Table 2. To this end, instead of using grid search, we apply the Keras tuner to help us select the best hyperparameter values. As for the multimodal fusion branch, the complete set of hyperparameters is shown in Table 3.

Once the model combines the information from these two data modalities and undergoes optimization, it is ready to make predictions regarding the next-day trends in the stock market. The model’s prediction is represented by a single neuron with a sigmoid activation function, a common choice for binary classification tasks. This neuron outputs a value between 0 and 1, serving as an indicator of the likelihood or probability of a specific stock’s next-day trend. A value closer to 1 suggests a higher probability of a positive trend, while a value closer to 0 indicates a higher likelihood of a negative trend in the stock’s performance.

The dataset used in our work and the implementation of the proposed approach are publicly available at the following online repository: https://github.com/rcorizzo/finbert-lstm.

4 Experiments

In this section, we provide a description of the experimental setup employed in our study. It encompasses the configuration of hyperparameters used for the various models, and the evaluation metrics used in our evaluation. Subsequently, we discuss the outcomes of our experiments in relation to the accuracy achieved by the different models. Lastly, we focus our attention on portfolio analysis, where we examine the consequences of utilizing model predictions as triggers for buying and selling decisions in real market scenarios.

4.1 Setup

In our experiments, we compare our results with popular baselines. For baselines, suitable values for hyperparameter tuning were chosen based on configurations reported as effective in the original papers or following the above-discussed rationale used for our proposed model, performing grid search using solely historical training data. For conciseness, in the following, we report the final best-performing configurations adopted in the experimental analysis:

  • ARIMA [57]: ARIMA, an abbreviation for autoregressive integrated moving average, is a forecasting approach characterized by three constituent components, each of which is controlled by a specific parameter. These components include the count of autoregressive terms denoted as (p), the quantity of nonseasonal differencing steps required to achieve stationarity marked as (d), and the number of lagged forecast errors integrated into the prediction equation represented by (q). The comprehensive ARIMA model can be formally expressed as follows:

    $$\begin{aligned} y'_{t} = c + \phi _{1}y'_{t-1} + \cdots + \phi _{p}y'_{t-p} + \theta _{1}\varepsilon _{t-1} + \cdots + \theta _{q}\varepsilon _{t-q} + \varepsilon _{t}, \end{aligned}$$
    (7)

    where \(y't\) represents a differenced time series, subject to differencing operations multiple times. The predicted values on the right-hand side encompass both past \(y't\) values and previous prediction errors. We use this widely known autoregressive method to forecast the closing price of each stock for the following day, using historical price data. The key aspect of this approach is to identify price trends by relying solely on the target variable of interest. The prediction is then transformed into a binary format, i.e., either an uptrend or a downtrend, by comparing the predicted value (next day) with the most recently observed closing price of the stock (current day). To automatically determine the best configuration for the parameters (pdq) based on historical training data, we use the Auto-ARIMA implementation provided in “pmdarima.”Footnote 4

  • GBTs [58]: For all other parameters, we adhere to the default configuration recommended by the method’s documentation in scikit-learn. gradient boosted trees (GBTs) are a competitive ensemble method within machine learning algorithms. In essence, GBTs blend “weak” machine learning models, such as decision trees, into a more resilient and precise ensemble machine learning model. The fundamental principle of gradient boosting involves iteratively enhancing the model’s predictions by training it to minimize the residual error of previous base models. Concretely, an incremental training strategy is employed, where one tree is acquired in each iteration. For every individual example, its prediction is generated by each model, and the final prediction is computed as the summation of the scores provided by all models. To be more precise, the objective function governing the optimization of tree boosting can be formally expressed as follows:

    $$\begin{aligned} {\hbox {obj}}^{(t)}= & {} \sum _{i=1}^{n} l(y_i, \hat{y}_i^{(t)}) + \sum _{i=1}^{t}\Omega (f_{i}) \end{aligned}$$
    (8)
    $$\begin{aligned}= & {} \sum _{i=1}^n l(y_{i}, {\hat{y}}_{i}^{(t-1)} + f_{t}(x_{i})) + \Omega (f_{t}) + {\textrm{c}}, \end{aligned}$$
    (9)

    where in l represents a differentiable convex loss function quantifying the disparity between the prediction, denoted as \(\hat{y_i}\), and the target value, \(y_i\). Additionally, \(\Omega \) is used to impose a penalty on the model’s complexity through a regularization term, which serves the purpose of mitigating overfitting. Within our methodology, this model is harnessed for the purpose of identifying underlying patterns within the multi-dimensional feature space and uncovering nonlinear associations among price data, statistical indicators, and future price trends. Specifically, we adopt the implementation of Gradient Boosted Trees provided by scikit-learn, accessible at the following.Footnote 5 We proceed to train the model using the ensuing hyperparameter configuration: \(\{n\_estimators=50, learning\_rate=1.0, max\_depth=10\}\).

  • LSTM [36]: The model is optimized via gradient descent using the Adam optimizer and the binary cross-entropy loss. The base LSTM model considered consists of two LSTM layers (500 and 100 units, respectively), a dropout layer, and a dense layer with a single neuron and sigmoid activation function to predict next-day trends.

  • Polarity [41]: In this work, we utilize sentiment analysis predictions acquired through the EODHD Financial Data API.Footnote 6 The sentiment analysis encompasses four distinct categories: polarity, negative, neutral, and positive, with each score ranging from -1 to 1. Upon obtaining sentiment scores for all news articles gathered about a specific stock on a given day, we proceed to consolidate them and derive a binary label serving as the global indicator for that stock on that particular day. Specifically, if the cumulative polarity score threshold is above 0.4, we assign the label “1”; otherwise, it is designated as “0.” The underlying rationale for this binary classification is rooted in the idea that positive media coverage could potentially signify an imminent uptrend. In contrast, negative media coverage implies uncertainty and skepticism, which may lead to a downtrend in the stock’s performance.

  • Bi-LSTM [38]: The model is optimized via gradient descent using the Adam optimizer and the binary cross-entropy loss. The base LSTM model considered consists of two LSTM layers (500 units). Following up, a bidirectional LSTM layer with (128 units) and a dropout rate of 0.8 is employed to enhance the model’s ability to learn complex patterns bidirectionally, while dropout regularization helps mitigate overfitting. Finally, a dense layer with a single unit and sigmoid activation is added to produce output predictions.

  • CNN Seq2Seq [42]: The model is a sequence-to-sequence architecture for time series data. The encoder starts with a 1D convolutional layer (128 filters, ReLU activation), followed by max pooling. An LSTM layer (128 units) captures temporal dependencies with 0.8 dropout regularization. A RepeatVector layer prepares encoded representation. In the decoder, dilated convolutional layers replace standard ones. A dilated convolutional layer (128 filters, ReLU activation, dilation rate 2) followed by max pooling. An LSTM layer (128 units) decodes temporal information with a dropout rate of 0.8. Finally, a time-distributed dense layer (single unit, sigmoid activation) generates output predictions for each time step.

  • Attention-CNN-LSTM [39]: The model architecture includes an encoder–decoder structure for time series data. The encoder starts with a 1D convolutional layer (128 filters, ReLU activation) for feature extraction, followed by max pooling. An LSTM layer (128 units) captures temporal dependencies with 0.8 dropout regularization. An attention mechanism enhances performance by focusing on relevant information, concatenated with encoder output for enriched representation. The decoder includes a similar convolutional layer, max pooling, and LSTM for decoding. Dropout regularization is applied again. Finally, a time-distributed dense layer (sigmoid activation) generates output predictions for each time step. This architecture integrates convolutional and LSTM layers, dropout regularization, and attention for effective time series processing and prediction.

  • Dilated CNN Seq2Seq [40]: The adopted model is a sequence-to-sequence architecture for time series data. The encoder consists of a 1D convolutional layer with 128 filters and ReLU activation, followed by max pooling. An LSTM layer with 128 units and dropout rate of 0.8 captures temporal dependencies, and a RepeatVector layer prepares the encoded representation for each time step. In the decoder, dilated convolutional layers replace standard convolutional layers. The first layer has 128 filters, ReLU activation, and a dilation rate of 2 for broader contextual information. Max pooling retains relevant features. An LSTM layer with 128 units decodes temporal information, with dropout rate of 0.8. Finally, a time-distributed dense layer with a single unit and sigmoid activation generates output predictions for each time step.

  • CNN-LSTM [42]: CNNs are effective for learning from time series data, with 1D convolutional layers filtering noise and extracting features. Causal convolution ensures influence only from previous time steps. RNNs excel in sequential learning tasks. We compare our model with CNN-LSTM, combining 1D CNN with LSTM, featuring convolutional layer, LSTM, batch normalization, dropout, and dense layer. Various model variants explored for optimal parameters: hidden layers (1 and 2), neurons (64 and 128), batch sizes (32 and 64), and dropout rates (0.2 and 0.5). Best-performing CNN-LSTM has a convolutional layer with 32 filters of size 3, causal padding, ReLU activation, LSTM with 128 units, tanh activation, followed by batch normalization, dropout (rate 0.2), and dense layer with ReLU activation.

  • GRU-CNN [37]: The GRU-CNN model combines GRU and 1D CNN, offering simpler training and improved performance. Parameters are similar to CNN-LSTM. A key difference stands in the arrangement of RNN and CNN layers. GRU-CNN is composed of a GRU layer (128 units, tanh activation), a 1D convolutional layer (32 filters, size 3, stride 1, causal padding, ReLU activation), global max pooling, batch normalization, dense layer (10 units, ReLU activation), dropout (rate 0.2), and a dense layer (prediction window size, ReLU activation). The GRU layer returns a sequence, whereas global max pooling retains significant features and reduces dimensionality.

  • FinBERT-LSTM (Proposed): The best-performing model architecture was obtained with two LSTM layers with 500 and 450 hidden units, interleaved by a normalization layer, followed by a batch normalization layer, and a dense representation layer of size 72. The best-performing FinBERT model following its standard architecture was obtained with a normalization layer followed by two dense layers of size 80 and 36, each of which was interleaved by a normalization layer. The best-performing architecture following to the fusion of the two modalities was obtained with two dense layers of size 40 and 60, interleaved by normalization layers.

For a quantitative evaluation of our results, we adopt conventional classification metrics such as Precision, Recall (R), and F1-Score, defined as:

$$ {\text {Precision}} = \frac{T_p}{T_p+F_p}; \ \ \ {\text {Recall}} = \frac{T_p}{T_p + F_n}; \ \ F1-{\text {Score}} = 2 \times \frac{{\text {Precision}} \times {\text {Recall}}}{{\text {Precision}} + {\text {Recall}}}, $$

where \(T_p\) is the number of true positives, and \(F_p\) is the number of false negatives. Specifically, we adopt micro-averaged variants of Precision, Recall, and F-Measure, i.e., metrics are computed globally by considering each element of the label indicator matrix as a label.

4.2 Model accuracy

In our analysis of different models’ performance on stock trend prediction, we aim to provide a broad evaluation and showcase model performance on a general set of real-world stocks with great diversity. To this end, we employ data from 12 real-world stocks in different sectors: Communication Services (ATVI, NFLX), Consumer Discretionary (SBUX, TSLA), Information Technology (NVDA, AAPL), Real Estate (AMT, PLD), Financials (NDAQ, SCHW), Healthcare (JNJ, BIO). One relevant perspective for results analysis is the model’s classification accuracy on one-day-ahead stock trend prediction. In the following, we describe summarized results by financial sector. A more fine-grained discussion is provided in the Appendix.

4.2.1 Communication services

For ATVI (Activision Blizzard), the best-performing model in uptrend market conditions is Polarity with an F1-Score of 0.63. In downtrend conditions, the proposed model performs best with an F1-Score of 0.53. The worst-performing models are GRU-CNN in downtrend (0.28) and Dilated CNN Seq2Seq in uptrend conditions (0.38). For NFLX (Netflix) in uptrend conditions, the best model is GBTs with an F1-Score of 0.54. In downtrend predictions, GBTs and ARIMA perform best with an F1-Score of 0.50. The worst models are Bi-LSTM (0.38) for uptrends and CNN-LSTM, CNN Seq2Seq (0.39) for downtrends.

4.2.2 Consumer discretionary

For SBUX (Starbucks), the best models for uptrends are ARIMA and GBTs with an F1-Score of 0.52. For downtrends, the best is Dilated CNN Seq2Seq (0.57). The worst models are GRU-CNN (0.37) for uptrends and CNN-LSTM (0.37) for downtrends. For TSLA (Tesla), the best uptrend models are Dilated CNN Seq2Seq, CNN Seq2Seq, and LSTM with an F1-Score of 0.56. For downtrends, Polarity performs best (0.55). The worst models are GRU-CNN (0.27) in uptrend and CNN-LSTM (0.31) in downtrend conditions.

4.2.3 Information technology

For NVDA (Nvidia), the best-performing uptrend model is Bi-LSTM with an F1-Score of 0.50. For downtrends, LSTM and Polarity lead with an F1-Score of 0.50. The worst-performing model in both conditions is CNN-LSTM (0.35 for uptrend and downtrend). For AAPL (Apple), Polarity is the best for uptrend with an F1-Score of 0.55. ARIMA leads in downtrend predictions with an F1-Score of 0.55. The worst models are ARIMA (0.40) for uptrends and GRU-CNN (0.41) for downtrends.

4.2.4 Real estate

For AMT (American Tower), the best uptrend model is GBTs with an F1-Score of 0.51. For downtrends, GBTs also lead with 0.54. The worst models are several (F1-Score of 0.40) for uptrends, and Bi-LSTM (0.40) for downtrends. For PLD (Prologis) ARIMA is the best model for uptrends with an F1-Score of 0.59. For downtrends, GRU-CNN performs best (0.60). The worst models are Attention-CNN-LSTM (0.48) for uptrends and Polarity (0.30) for downtrends.

4.2.5 Financials

For NDAQ (Nasdaq) ARIMA is the best for uptrends with an F1-Score of 0.58. GBTs perform best for downtrends (0.52). The worst model is Polarity (0.39 for uptrends and 0.36 for downtrends). For SCHW (Charles Schwab) GBTs is the best model for uptrends with an F1-Score of 0.55. For downtrends, GBTs lead with 0.53. The worst models are CNN-LSTM (0.36) for uptrends and Attention-CNN-LSTM, Bi-LSTM (0.37) for downtrends.

4.2.6 Health

For JNJ (Johnson & Johnson), the best uptrend model is GBTs with an F1-Score of 0.56. ARIMA leads in downtrend predictions with an F1-Score of 0.54. The worst models are Polarity (0.49) for uptrends and GRU-CNN (0.38) for downtrends. For BIO (Bio-Rad Laboratories), the best-performing model for uptrends is GRU-CNN with an F1-Score of 0.56. For downtrends, GRU-CNN again leads with 0.56. The worst models are Attention-CNN-LSTM, Polarity (0.40) for uptrends and Polarity (0.26= for downtrends.

Analyzing the performance of different models for each stock provides insights into their effectiveness in predicting stock trends. Considering base models performance with different stocks, GBTs emerge as the most robust approach across all stocks, followed by LSTM, Polarity, and ARIMA. We also observe that Polarity performs particularly well in predicting TSLA stock trends, while LSTM shows competitive performance in AAPL predictions. However, experimental results in Tables 4, 5 and 6 show that different models exhibit varying performance in capturing uptrends and downtrends in stock prices.

In uptrend market conditions, GBTs consistently achieves the highest F1-Scores across multiple stocks, indicating its effectiveness in capturing upward price movements. ARIMA and Proposed also demonstrate a competitive performance in predicting uptrends for certain stocks. However, convolutional neural network-based models such as CNN-LSTM and GRU-CNN consistently underperform other approaches across various stocks and market conditions.

In downtrend market conditions, however, the proposed model outperforms other models, as it consistently achieves the highest F1-Scores across various stocks, showcasing strength in capturing downward price movements. Overall, considering both uptrend and downtrend predictions, the proposed model emerges as the top-performing approach to capture the overall stock market trend. It excels in predicting both upward and downward price movements, representing a robust choice for stock trend analysis. GBTs, ARIMA, Polarity, LSTM, BI-LSTM, CNN Seq2Seq, Attention-CNN-LSTM, Dilated CNN Seq2Seq, CNN-LSTM, and GRU-CNN can, in some cases, also provide a competitive performance, although their level of reliability significantly varies across different stocks.

One possible explanation for the generally low performance of CNN-based models is the noisy/highly fluctuating nature of financial data, as well as the lack of spatial dependencies in the data, which makes CNN-based models ineffective. On the other hand, models that focus on temporal dependencies (LSTM, ARIMA), models that are more robust to noise (GBTs), and models that consider a holistic combination of data modalities (Proposed) appear more effective in predicting stock trends. Another relevant observation from a bias-variance perspective is that ARIMA models are typically characterized by low variance, while GBTs are flexible enough to capture complex patterns without overfitting excessively, leading to more accurate predictions than CNN-based models.

It is important to note that the performance of these models may be influenced by changing macroeconomic conditions and concept drift in the considered time frames. The upward trajectory of stock prices in June 2021 can be attributed to a confluence of economic factors. Firstly, the global economy was in a recovery phase after the severe economic downturn triggered by the COVID-19 pandemic. Governments and central banks across the world had implemented a series of monetary and fiscal policies to stimulate economic growth, which boosted investor confidence. Secondly, companies began releasing favorable earnings reports during this period. Positive financial performance exceeding market expectations can act as a catalyst for stock price appreciation as it instills investor optimism. Thirdly, the persistently low interest rates in 2021 made stocks an attractive investment option due to the relatively higher potential returns when compared to low-yield fixed-income securities. Furthermore, sectors such as technology and growth stocks exhibited considerable appeal to investors, with their higher growth potential contributing to increased stock prices. Lastly, speculative trading activities, epitomized by events like the GameStop and AMC short squeezes, generated significant retail investor participation and volatility, influencing stock price movements.

The decline in stock prices during 2022 can be attributed to a combination of economic and market factors. First and foremost, economic conditions, particularly the concerns over rising inflation and interest rates, were central to the reduced attractiveness of stocks. The expectation of increasing inflation and interest rates led investors to consider alternative investments that offered better protection against eroding purchasing power and higher fixed returns. Secondly, geopolitical events, such as trade disputes and international conflicts, introduced uncertainty into the market, undermining investor sentiment. Geopolitical tensions can result in market volatility, which has a detrimental impact on stock prices. Additionally, corporate earnings played a pivotal role in driving down stock prices. Companies reporting weaker-than-expected earnings, often compounded by supply chain disruptions and increased production costs, faced downward pressure on their stock prices. Moreover, central bank policies, including potential interest rate hikes, can lead to stock market declines, as higher borrowing costs and reduced liquidity negatively affect equity valuations. Finally, market sentiment, determined by factors like fear, uncertainty, and pessimism, significantly influenced stock market performance in 2022, contributing to the overall decline in stock prices.

Table 4 One-day-ahead stock prediction performance in Uptrend (July 1, 2021 to Dec 31, 2021—left) and Downtrend (Jan 1, 2022 to Sep 20, 2022—right) market conditions in terms of Precision, Recall, and F1-Score (micro average) with all analyzed methods and stocks (ATVI, NFLX, SBUX, TSLA)
Table 5 One-day-ahead stock prediction performance in Uptrend (July 1, 2021 to Dec 31, 2021—left) and Downtrend (Jan 1, 2022 to Sep 20, 2022—right) market conditions in terms of Precision, Recall, and F1-Score (micro average) with all analyzed methods and stocks (NVDA, AAPL, AMT, PLD)
Table 6 One-day-ahead stock prediction performance in Uptrend (July 1, 2021 to Dec 31, 2021—left) and Downtrend (Jan 1, 2022 to Sep 20, 2022—right) market conditions in terms of Precision, Recall, and F1-Score (micro average) with all analyzed methods and stocks (NDAQ, SCHW, BIO, JNJ)

To validate the statistical significance of our results, we adopt Wilcoxon signed rank tests on all pairwise combinations of methods across multiple executions, obtained considering the average F1-Score with different stocks. Based on results reported in Table 7, we can infer that in downtrend market conditions, the proposed method outperforms 9 out of 10 baselines in terms of F1-Score, and 6 of 10 comparisons are statistically significant. In the single case where one baseline (GBTs) outperform the proposed method, this comparison is not statistically significant (p-value of 0.3556). In uptrend market conditions, the proposed method outperforms 6 out of 10 baselines in terms of F1-Score, and 3 of 6 comparisons are statistically significant. In the 4 other cases where baselines (ARIMA, LSTM, GBTs, and Polarity) outperform the proposed method, all cases are not statistically significant (p-values: 0.7508, 0.0606, 0.0606, and 0.2366).

Table 7 Average performance of different methods across multiple stocks (F1-Score) and statistical analysis with Wilcoxon signed rank tests (p-value) comparing all pairwise combinations of methods
Table 8 Simulated portfolio gains (ATVI stock) in uptrend (left) and downtrend (right): absolute (USD) and relative (percentage) with respect to the initial investment, with all models and different Max Shares configurations (1, 5, 10)
Table 9 Simulated portfolio gains (NFLX stock) in uptrend (left) and downtrend (right): absolute (USD) and relative (percentage) with respect to the initial investment, with all models and different Max Shares configurations (1, 5, 10)
Table 10 Simulated portfolio gains (SBUX stock) in uptrend (left) and downtrend (right): absolute (USD) and relative (percentage) with respect to the initial investment, with all models and different Max Shares configurations (1, 5, 10)
Table 11 Simulated portfolio gains (TSLA stock) in uptrend (left) and downtrend (right): absolute (USD) and relative (percentage) with respect to the initial investment, with all models and different Max Shares configurations (1, 5, 10)
Table 12 Simulated portfolio gains (NVDA stock) in uptrend (left) and downtrend (right): absolute (USD) and relative (percentage) with respect to the initial investment, with all models and different Max Shares configurations (1, 5, 10)
Table 13 Simulated portfolio gains (AAPL stock) in uptrend (left) and downtrend (right): absolute (USD) and relative (percentage) with respect to the initial investment, with all models and different Max Shares configurations (1, 5, 10)
Table 14 Simulated portfolio gains (AMT stock) in uptrend (left) and downtrend (right): absolute (USD) and relative (percentage) with respect to the initial investment, with all models and different Max Shares configurations (1, 5, 10)
Table 15 Simulated portfolio gains (PLD stock) in uptrend (left) and downtrend (right): absolute (USD) and relative (percentage) with respect to the initial investment, with all models and different Max Shares configurations (1, 5, 10)
Table 16 Simulated portfolio gains (NDAQ stock) in uptrend (left) and downtrend (right): absolute (USD) and relative (percentage) with respect to the initial investment, with all models and different Max Shares configurations (1, 5, 10)

4.3 Portfolio analysis

In our experiments, we begin with a budget of 10,000 USD, and allocate the entire amount to purchase as many shares as possible at the prevailing market price on that day. The parameter “Max Shares” with values of 1, 5, or 10 is used to determine the number of stocks to be bought or sold each day based on the trend predictions generated by various models. When a model predicts a downtrend for the following day, we sell the desired number of shares at the current market price to mitigate potential losses. This action increases the available USD balance in the portfolio, which can then be reinvested in purchasing additional shares. On the other hand, if a model predicts an uptrend and the available USD balance is positive, we utilize the available balance to buy a number of shares that is either equal to or less than the desired number.

In cases where a model predicts an uptrend, but there is no available USD balance, we choose to hold onto all the previously purchased shares. At the conclusion of the simulation, all shares held in the portfolio are sold at their respective market prices. The absolute value of the portfolio in USD at this point is used to calculate the gains or losses relative to the initial budget of 10,000 USD.

In our simulation, we do not consider trading fees as some brokers, such as Charles Schwab,Footnote 7 do not apply them to online transactions and profit from different revenue streams. A custom trading fee can be easily considered using our public code implementationFootnote 8. In our experiments, we perform two separate trading simulations in the two time frames covered in our performance analysis: uptrend (July 1, 2021 to Dec 31, 2021) and downtrend (Jan 1, 2022 to Sep 20, 2022). The findings of our portfolio analysis are outlined in Tables 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18 and 19, while Figs. 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 and 13 offer time series visualizations depicting the closing prices of each stock over time, alongside the buy and sell signals generated by all models. To maintain conciseness in our discussion, we present the plot for the most effective model, as determined by the portfolio results.

Examining the time series plots enables us to gain valuable insights into the market conditions of the various stocks on the overall time frame of our analysis (both uptrend and downtrend). For ATVI, Fig. 2 reveals that the stock started at \(\$95\), observed a downtrend up to the middle of the evaluation time frame (with a bottom at \(\$55\)), followed by an uptrend up to \(\$82.5\), after which the price retraced to about \(\$75\), which is lower than the first day of the evaluation. Moving to NFLX, Fig. 3 reveals that this stock observed a significant downtrend throughout the entire evaluation time frame, starting with a value of over \(\$525\) per share and ending up with \(\$210\). As for SBUX, a similar uncertainty pattern can be observed in Fig. 4, with a starting price of \(\$112.5\), a fast uptrend phase up to \(\$126\), followed by a strong downtrend, which led to a bottom price of \(\$70\), with a subsequent sideways phase. The stock recouped its value in the final uptrend phase, ending with a value of \(\$92\), higher than the first evaluation day. For TSLA, Fig. 5 reveals that the stock started at \(\$225\) and observed a sideways phase followed by an uptrend phase, reaching a maximum price above \(\$400\). Subsequently, the stock observed a retracement phase in the channel \(\$300-\$375\) followed by a downtrend with low peaks at \(\$250\) and \(\$200\), recouping to \(\$300\) toward the end of the evaluation time frame. In the case of NVDA (Fig. 6), the stock exhibited an initial value of \(\$200\) and experienced an uptrend, reaching a peak of \(\$340\) around the midpoint of the evaluation period. Afterward, it encountered fluctuations and declined to \(\$280\) before recovering to \(\$300\), which was higher than the value on the first evaluation day. Analyzing AAPL (Fig. 7), it displayed a substantial uptrend throughout the entire evaluation duration, starting at approximately \(\$135\) per share and concluding at \(\$180\). Following this, there were fluctuations leading to a dip to \(\$130\), with a subsequent recovery to \(\$150\), which was slightly higher than the value on the initial evaluation day. Regarding AMT, a similar pattern of uncertainty was observed (Fig. 8), with an initial price of \(\$270\), followed by a rapid uptrend phase up to \(\$305\). However, this was followed by a strong downtrend, resulting in a bottom price of \(\$230\), leading to a subsequent sideways phase. Toward the end of the evaluation period, the stock recovered its value during the final uptrend phase, ending with a value of \(\$245\), higher than its value on the first evaluation day. In the case of PLD (Fig. 9), the stock displayed an initial value of \(\$120\) and experienced an uptrend, reaching a peak of \(\$172\) around the midpoint of the evaluation period. Subsequently, it encountered fluctuations and declined to \(\$110\) before recovering to \(\$112\), a lower value than that observed on the first evaluation day. In the case of NDAQ (Fig. 10), the stock showed a significant uptrend during the entire evaluation period, commencing at around \(\$58\) per share and reaching \(\$73\). However, there were fluctuations that caused a dip to \(\$47.5\), followed by a recovery to \(\$61\), a value similar to that of the first evaluation day. Examining SCHW (Fig. 11), the stock initiated at \(\$74\) and underwent an uptrend, peaking at \(\$95\) around the midpoint of the evaluation period. Following this, there were fluctuations leading to a decline to \(\$60\) before recovering to \(\$74\), a value similar to that on the first evaluation day. Similarly, JNJ exhibited a similar pattern of uncertainty (Fig. 12), with an initial price of \(\$165\), followed by a rapid uptrend phase up to \(\$180\). However, a significant downtrend ensued, resulting in a bottom price of \(\$155\), followed by a subsequent sideways phase. Toward the end of the evaluation period, the stock recovered its value during the final uptrend phase, ending with a value of \(\$169\), higher than its value on the first evaluation day. Lastly, examining BIO (Fig. 13), the stock displayed a significant uptrend throughout the entire evaluation period, starting at approximately \(\$650\) per share and ending at \(\$825\). However, there were fluctuations that caused a dip to \(\$710\), followed by a retracement to \(\$450\), a value significantly lower than its value on the first evaluation day.

In analyzing model profitability, the findings presented in Tables 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18 and 19 indicate that the proposed multimodal approaches outperform the individual base models. This observation is particularly significant as it demonstrates that the accuracy rankings of different models, as discussed in the previous section, may not necessarily align with their profitability. A notable example of this phenomenon is evident in the case of AAPL (see Table 13), where the proposed model achieves a lower F1-Score compared to the best-performing ARIMA base model (as seen in Table 4). Despite this, the proposed model manages to achieve the largest gains, averaging \(13.49\%\) in the downtrend time frame (see Table 13), which is notably higher than the \(-9.75\%\) obtained with ARIMA.

The observed phenomenon can be explained by the tendency of binary performance metrics, like F1-Scores, to assign equal significance to the accurate prediction of trends across all days. However, these metrics may not adequately address the fluctuations in price changes occurring within a 24-hour time frame. As a result, a model that effectively predicts trends on days characterized by significant price volatility might identify advantageous opportunities for buying and selling. Conversely, a model that accurately forecasts trends for a larger number of days when prices are relatively stable may not generate a comparable level of profitability. This underscores the importance of incorporating additional considerations, such as price variability, when evaluating the performance and profitability of models in the context of predicting stock trends.

The superiority of the proposed multimodal approach is also evident in its performance on uptrends in NFLX and downtrends in NFLX, SBUX, TSLA, AAPL, NDAQ, and BIO. It is worth noting that our model is particularly conservative in a downtrend phase, resulting in better capital preservation when compared with a buy-and-hold strategy. In the following, we focus on significant examples where this phenomenon is observed.

For NFLX (see Table 9), the proposed multimodal approach achieves an average gain of \(19.42\%\) in USD, surpassing the second-ranked Polarity model’s performance of \(14.22\%\) in uptrend market conditions. In the case of SBUX (see Table 10), the proposed multimodal approach achieves an average gain of \(7.11\%\) in USD in downtrend market conditions, surpassing the second-ranked Dilated CNN Seq2Seq model’s performance of \(6.75\%\). In the case of TSLA (see Table 11), the proposed model stands out in downtrend market conditions as the only model with a positive average gain percentage, exceeding CNN-LSTM, the second-best model in TSLA, by an impressive \(13.75\%\) in USD. In the case of AAPL (see Table 13), the proposed model emerges as the outperforming model demonstrating a favorable average gain percentage. It outperforms the second-ranked model, Dilated CNN Seq2Seq, in the AAPL context, by a notable margin of \(12.38\%\) in USD. Similarly, in the case of NDAQ (see Table 16), the proposed multimodal approach achieves an average gain of \(3.92\%\) in USD, outperforming the second-ranked Polarity, which obtains an average gain of \(1.36\%\). For the case of BIO (see Table 19), the proposed multimodal approach achieves an average gain of \(-3.75\%\) in USD, outperforming the second-ranked Attention-CNN-LSTM, which obtains an average gain of \(-4.44\%\).

In our analysis, it is noteworthy that in the uptrend time frame, our proposed model outperforms the buy-and-hold strategy with two stocks (ATVI, NFLX). Furthermore, our model identifies superior performance relative to the buy-and-hold strategy for nine stocks (NFLX, SBUX, NVDA, AAPL, PLD, NDAQ, SCHW, JNJ, and BIO), despite them being in a downtrend.

From a theoretical viewpoint, the observed results reveal that analyzing multiple perspectives (such as stock prices, technical indicators, sentiment in news headlines) through multimodal data fusion yields a model that requires stronger evidence to predict changes in stock market trends. On the contrary, other baseline models that are mostly focused on temporal aspects of data, appear more sensitive. This consideration explains why such baseline models may have an advantage in an uptrend phase, whereas our proposed model appears more conservative. As a consequence, our multimodal fusion model presents a more limited exposure in an uptrend phase, but can provide better capital preservation capabilities than others in a downtrend phase.

In general, the superior performance observed for the proposed model can be attributed to its ability to integrate diverse data sources, extract complex and high-level features, capture nonlinear relationships, and learn in an end-to-end manner, leading to reduced bias and improved generalizability. These advantages enable the proposed multimodal model to provide more accurate and robust predictions compared to simpler models that might only utilize a single type of data or simpler feature extraction techniques.

Fig. 2
figure 2

Stock prices with buy and sell signals during the evaluation period from July 1, 2021, to Sep 20, 2022 (ATVI stock)

Fig. 3
figure 3

Stock prices with buy and sell signals during the evaluation period from July 1, 2021, to Sep 20, 2022 (NFLX stock)

Fig. 4
figure 4

Stock prices with buy and sell signals during the evaluation period from July 1, 2021, to Sep 20, 2022 (SBUX stock)

Fig. 5
figure 5

Stock prices with buy and sell signals during the evaluation period from July 1, 2021, to Sep 20, 2022 (TSLA stock)

Fig. 6
figure 6

Stock prices with buy and sell signals during the evaluation period from July 1, 2021, to Sep 20, 2022 (NVDA stock)

Fig. 7
figure 7

Stock prices with buy and sell signals during the evaluation period from July 1, 2021, to Sep 20, 2022 (AAPL stock)

Another interesting perspective is provided by analyzing the impact of the “Max Shares” parameter in our experiments. In the analysis of various stocks during both uptrends and downtrends, several trends emerge, although these patterns are not consistent across all cases. During uptrends, there is a tendency for multiple stocks, such as ATVI, NFLX, and SBUX, to perform better with a “Max Shares” value set to 1, while AAPL, AMT, PLD, NDAQ, JNJ, and BIO achieve best with a “Max Shares” value set to 10.

However, during downtrends, the choice of “Max Shares” value varies widely among different stocks, with some favoring a value of 10 (e.g., ATVI, AAPL, SCHW) and others performing better with a value of 1 (e.g., NFLX, TSLA, NVDA, AMT, PLD, NDAQ, SCHW, JNJ, BIO). The performance of specific models, including CNN-LSTM, GRU-LSTM, LSTM, Polarity, ARIMA, and GBTs, exhibits inconsistency, demonstrating their varied efficacy depending on the specific stock and market conditions. These patterns suggest that the relationship between “Max Shares” and model performance is stock-specific and influenced by unique price behaviors, highlighting the need for individualized analysis and tailored strategies for each stock.

Table 17 Simulated portfolio gains (SCHW stock) in uptrend (left) and downtrend (right): absolute (USD) and relative (percentage) with respect to the initial investment, with all models and different Max Shares configurations (1, 5, 10)
Table 18 Simulated portfolio gains (JNJ stock) in uptrend (left) and downtrend (right): absolute (USD) and relative (percentage) with respect to the initial investment, with all models and different Max Shares configurations (1, 5, 10)
Table 19 Simulated portfolio gains (BIO stock) in Uptrend (left) and Downtrend (right): absolute (USD) and relative (percentage) with respect to the initial investment, with all models and different Max Shares configurations (1, 5, 10)
Fig. 8
figure 8

Stock prices with buy and sell signals during the evaluation period from July 1, 2021, to Sep 20, 2022 (AMT stock)

Fig. 9
figure 9

Stock prices with buy and sell signals extracted by all models during the evaluation period from July 1, 2021, to Sep 20, 2022 (PLD stock)

Fig. 10
figure 10

Stock prices with buy and sell signals during the evaluation period from July 1, 2021, to Sep 20, 2022 (NDAQ stock)

Fig. 11
figure 11

Stock prices with buy and sell signals during the evaluation period from July 1, 2021, to Sep 20, 2022 (SCHW stock)

Fig. 12
figure 12

Stock prices with buy and sell signals during the evaluation period from July 1, 2021, to Sep 20, 2022 (JNJ stock)

Fig. 13
figure 13

Stock prices with buy and sell signals during the evaluation period from July 1, 2021, to Sep 20, 2022 (BIO stock)

5 Conclusion

In this article, we propose a novel multimodal deep learning method for financial time series forecasting. The primary challenge of next-day trend prediction in the financial sector with joint exploitation of text and time series data is addressed by our method. To this end, our model consists of a BERT-based model branch fine-tuned on financial news, and a LSTM branch to capture useful temporal patterns. Our extensive experiments using real-world stock market datasets periods showed that our proposed method is competitive with respect to popular baselines, in both uptrend and downtrend market conditions.

Our portfolio analysis showed that our method could be fruitfully adopted in a trading scenario, yielding positive gains in an uptrend phase, as well as capital preservation in a downtrend phase, outperforming other baselines as well as a buy-and-hold strategy. Possible limitations of our work may include considering a single source for news headlines and the non-exploitation of correlations between stocks for the prediction of a single stock. In future research, we will study the exploitation of additional data modalities for next-day trend stock market prediction. Moreover, we will investigate the adoption of deep learning methods such as attention and graph convolution for this analytical task.