1 Introduction

The selection of financial assets marks the beginning of most decision-making processes in directional investment and trading, where portfolio selection centers on striking the best balance between risk and reward. The behavioral finance literature has extensively documented numerous financial uncertainties and market anomalies, such as calendar effects and momentum anomalies, which may lead to excess returns. Financial markets harbor unpredictable or challenging-to-assess risks and volatility. This uncertainty can stem from various factors, including political, economic, monetary policy, market demand and supply dynamics, among others. Excessive financial uncertainty leads to market sentiment swings, unreasonable asset price fluctuations, and investor overreactions, ultimately impacting the efficient functioning of markets and capital allocation. To account for such aberrations in behavior, scholars have explored the use of modern artificial intelligence techniques, particularly in quantitative trading. This study proposes a momentum portfolio selection algorithm that incorporates machine learning-based ranking methods. The system models numerous factors pertaining to the state of the financial market and that of the wider economy using heterogeneous knowledge graph (KG) approaches, effectively integrating information on firm operations and the financial market.

Statistical analysis has traditionally been used for financial forecasting in investment. However, traditional linear regression models rest on an assumption of error term independence, which often fails to hold in financial time series data. To address this limitation, several approaches have been proposed [1,2,3,4]. Linear regression has the advantages of being easily interpretable, being able to mitigate overfitting through regularization, and allowing for efficient updating using stochastic gradient descent. Nonetheless, it cannot handle nonlinear relationships and lacks flexibility in complex model identification, which compromises its robustness.

Deep learning has become more accessible and powerful with advances in computer hardware. Deep learning (DL), a type of machine learning (ML), employs multiple layers of linear or nonlinear transformations for data feature extraction, which allows for automatic feature extraction. Conversely, conventional ML approaches depend on human-generated algorithms for feature creation, necessitating expert input-this process is termed feature engineering. DL’s ability to automatically extract data features, model complex nonlinear systems, and improve the learning for the model substantially enhances the model’s capabilities. DL’s applications in finance have expanded considerably in scope [5,6,7].

Traditional linear regression models, which base their forecasts on time series stock prices, aim to predict whether future stock returns are positive or negative. However, these models primarily focus on the price fluctuations of individual stocks, overlooking the relationships between them. Historically, most stock-related studies have treated each stock as an independent entity, neglecting the interdependence between stocks. However, the stock prices of companies in the same industry or supply chain often move together. Feng et al. [8] demonstrated the intricate relationships and informational interconnections between stocks. They employed the temporal graph convolutional (TGC) method to rank stocks, effectively capturing time-sensitive stock relationships. Their approach, which incorporated industry attributes and supply chain relationships, yielded promising results.

Beyond predicting future stock price returns or trends, studies have also focused on the behavior of ongoing trends. For instance, Jegadeesh and Titman [9] proposed the concept of the momentum effect, where a stock’s future returns is similar to its historical returns. Thus, investors can capitalize on this momentum effect by buying stocks with high historical returns and selling those with low returns, a tactic referred to as momentum investment. In this strategy, a portfolio contains a basket of strongly performing assets and thus has low risk and high yields.

Most studies on stock price forecasting have focused solely on individual stock prices, often neglecting a heterogeneous variety of information. With regard to corporate governance, each company has major institutional shareholders responsible for directing and managing the company’s affairs. These shareholders, tasked with achieving the company’s mission and maximizing profits, play a crucial role in corporate governance. Their influence extends to the relationships between different companies, especially when they hold appointments in multiple companies simultaneously. The interconnections among these individuals can also influence stock price trends.

When major institutional shareholders hold positions across multiple companies, they often jointly influence the stock price trends of those companies. Despite this, research on utilizing information about major institutional shareholders for prediction purposes remains limited. In this study, a heterogeneous KG is constructed to represent the relationship chain among the major institutional shareholders of each company. This graph is combined with time series stock price data, using DL technology to extract relationship information between individual stocks, thereby enhancing the predictive accuracy of the model. Finally, stocks are ranked based on the model’s predictions, and a portfolio is constructed from the top and bottom percentiles of stocks. The contributions of this study are as follows:

  • The proposal of a novel multitask supervised learning approach based on the learning-to-rank algorithm for identifying investment opportunities.

  • The implementation of a KG embedding approach that incorporates company governance factors and financial domain knowledge to model heterogeneous relationships, aiming to enhance the accuracy of financial decision models.

  • The development of a momentum-based trading strategy that combines heterogeneous information and ranking results to create an optimal investment portfolio.

  • An empirical test of the framework using real-world trading market data, focusing on achieving the best Sharpe ratio and return, thereby enabling more effective portfolio allocation for investors.

The remainder of this study is structured as follows: Section 2 discusses the use of artificial intelligence (AI) in finance, explores the combination of AI trading strategies with momentum, and reviews the literature on heterogeneous knowledge graphs. Section 3 summarizes the proposed system, discusses abnormal market situations, and reviews the application of ranking in finance. This section also describes the data used in this study. Section 4 presents the data set and experimental design, including a comparison with various baseline methods. Section 5 concludes the study with a summary of its findings.

2 Related work

This section reviews the literature on the application of AI in the financial sector, the use of AI in momentum trading strategies, and heterogeneous knowledge graphs.

2.1 Applications of AI in finance

Linear and nonlinear methods for forecasting stock prices and trends have been formulated. DL technology has gained prominence in stock price analysis, resulting in notable advancements. For example, various DL models have been employed, such as long short-term memory (LSTM), recurrent neural networks, convolutional neural networks (CNN), and stacked autoencoders (SAE). These models often incorporate sliding window techniques to forecast stock prices and capture the underlying data dynamics [10,11,12,13,14,15,16]. DL methods have been applied to a range of financial problems beyond stock price prediction [17, 18]. The advantages of DL models include their ability to identify patterns in data, extract relevant features, achieve high levels of accuracy, and generalize well outside the training data.

Multitask learning, which processes multiple targets simultaneously, improves model performance by leveraging shared features among tasks. Mahmoud et al. [19] addressed the problem of data deficiency in personalized time series prediction by proposing a multitask framework with a novel convolutional recurrent neural network. Their approach utilized transfer learning for multitask learning models. Similarly, a study conducted in the same year focused on multitask time series forecasting using shared attention mechanisms, aiming to enhance the accuracy of forecasting multiple time series concurrently [20]. Furthermore, high degree of success in hybrid adaptive and pre-training multitask learning have also been reported [21, 22].

In addition to predicting stock price movements to maximize investor returns, certain studies have concentrated on ranking the projected prices and recommending top-K stocks based on these rankings, thereby aiding in risk diversification. For instance, Hsu et al. [23] proposed a recommendation model named financial graph attention networks (FinGAT) that leverages time series data on stock prices and sector information to generate return ratios and suggest the top-K profitable stocks for investors. Their model outperformed other models in evaluation experiments. Another article combined the concept of momentum with graph-based techniques to capture inherent relationships and dynamics within financial markets [24].

Thus, DL brings numerous advantages in financial forecasting; it can handle vast, complex financial data sets and autonomously learn and extract features. This study aimed to utilize multitask DL techniques to comprehend both linear and nonlinear relationships within stock data, thereby capturing additional implicit market behaviors and financial trading patterns.

2.2 AI Trading in momentum application

In a momentous 1993 study, Jegadeesh and Titman [9] demonstrated that a simple trading strategy of buying stocks that have recently outperformed (winners) and selling those that have recently underperformed (losers) yielded substantial returns on the US stock market. This study garnered widespread attention because it challenged the prevailing notion of stock market efficiency. The then-dominant hypothesis stated that the stock market is completely efficient; this means that all market information has already been priced in , leaving no room for profitable trading opportunities. However, Jegadeesh and Titman’s findings suggested that this trading strategy still yielded robust returns, indicating that the stock market is not completely efficient.

Enhanced momentum trading strategies were later proposed by Takeuchi and Lee [25], who applied DL techniques to extract features from historical stock prices to predict future returns. Kim [26] also applied DL to enhance the effectiveness of a momentum-based investment strategy within the stock market. The traditional strategy is one where selecting 10% of the winners and losers are selected. Kim tested this strategy against another one where only 10% of the winners are selected. Additionally, Kim proposed a new strategy where, if the expected return for the holding period is positive, a purchase is made for stocks in the top 10 percentile, and conversely, a sale is made for stocks in the bottom 10 percentile if the expected return is negative. Subsequent research integrated the learning-to-rank method to enhance cross-sectional momentum portfolios [27]. The results demonstrated that the ranking model significantly enhanced the trading performance of cross-sectional strategies, outperforming traditional approaches.

Traditional momentum-based strategies often struggle to leverage the vast and intricate information within the market, resulting in insufficient responses to rapidly changing market conditions. However, DL can be used to uncover latent patterns in the market by analyzing the complex interconnections embedded in the data, and thus yield higher returns in momentum trading. Subsequent studies have continued to enhance the predictive efficiency of momentum trading approaches. This study contributes to this effort by using DL to understand the interaction between momentum trading behavior and high-heterogeneity data. A pre-trained model was used for portfolio selection. In the following section, the foundational concepts and existing literature on heterogeneous KGs are detailed.

2.3 Heterogeneous KG Embedding

KG embedding primarily involves converting data into low-dimensional vector representations. These vectors are designed to represent semantic similarity and correlations between words or phrases, thereby aiding tasks in ML and DL. KG embedding has been applied in diverse domains [28], including finance [29,30,31,32].

Given interdependencies and price comovements between stocks, numerous studies have incorporated rich, real-world information about the relationships between companies to obtain more accurate stock price predictions. This information pertains to factors such as supplier and customer relationships, industry affiliation, and geographical location.

To enhance stock price predictions, researchers have incorporated information about a target company and its relationships with other companies. They constructed a graph using real-world data, such as suppliers, customers, industry, and location. Subsequently, the model was trained to develop a distributed representation of the nodes in the graph. The results indicated that the model accurately captured the relationships between corporations [33, 34]. Kim et al. [35] proposed a hierarchical attention network for stock prediction, named HATS, which consolidates information from various types of relationships to derive valuable node representations. HATS integrates pertinent company information into its learning process, enabling predictions not only of individual stock prices but also of broader trends in financial market indexes. Other scholars have conducted related research, contributing further to this field [36,37,38].

Incorporating heterogeneous knowledge graph embedding techniques in this study models the relationships between stocks and relevant factors such as corporate governance and financial domain knowledge, where KG embedding positions associated knowledge points in close proximity in a low-dimensional space, enhancing the efficiency of queries and matches of related knowledge. The subsequent chapter details the methodology.

3 Methodology

This study introduces a momentum portfolio selection algorithm that integrates machine-learning-based ranking methods. The system is designed to effectively utilize company operations data and domain expertise in the financial market. It models numerous finance and economics-related factors through heterogeneous KG approaches. This framework employs deep neural network (DNN) and CNN technologies to extract data features. The extracted features are then used to rank stocks within a given stock pool. Based on these rankings, the top and bottom percentiles of stocks are selected to construct a portfolio, which is subsequently evaluated for performance.

3.1 Primary

In this section, the primary baseline methods compared in this study are introduced, such as the dual-classifier method by Huang et al. and the relational stock ranking (RSR) model by Feng et al.

Huang et al. [39] proposed binary dual-classifier algorithms designed to accurately model overreactions and strengthen portfolio composition using contrarian trading strategies. The method combines financial knowledge with high-dimensional nonlinear models constructed through ML techniques, enabling the identification of financial time series patterns. Initially, the model selects the best and worst-performing 10% of stocks from the current investing stock pool. These stocks are then fed into separate classifiers, and the positive and negative labels of the winners and losers inputted into a learning algorithm. This process updates the classifiers for the winners and losers. Subsequently, the outcomes of these classifiers are used to update the winner and loser portfolios. This approach is termed the dual-classifier method, and its framework is depicted in Fig. 1

Fig. 1
figure 1

The Dual-classifier model framework

This study introduces a multitask framework model for portfolio selection, which incorporates the dual-classifier concept proposed by Huang et al. and employs a ranking approach to enhance the predictive capability of the model. Unlike traditional trend prediction tasks, this selection method prioritizes stocks that are likely to sustain momentum in the future.

Feng et al., in 2019, [8] developed the RSR framework for predicting stock prices. This framework used the TGC method to capture the relationships between stocks. The RSR architecture is comprised of three layers: the sequential embedding layer, the relation embedding layer, and the prediction layer. Figure 2 illustrates the overall structure of the RSR framework.

Fig. 2
figure 2

Relational stock ranking framework

The foundation of the RSR framework is the belief that past stock price changes significantly influence future ones. This belief is reflected in the use of a sequential model that identifies dependencies between each stock price sequence. The first layer of the model is an LSTM model, which is capable of retaining long-term sequence memory. The last hidden layer of the LSTM model serves as input for the subsequent layer. To capture the relationships between stocks, the model includes a relation embedding layer that incorporates industry attributes and supply chain relationships. The rationale is that if two companies are in the same industry or share a supply chain, their stock prices should exhibit similar trends or a transmission effect. The final prediction layer combines information from the sequential embedding layer and the relation embedding layer, concatenates them, and then outputs the predicted return ratio, which represents the ranking scores. These scores are then used to construct an investment portfolio.

The study used data from the NASDAQ and NYSE markets, covering the trading period from February 1, 2013 to August 12, 2017, and including 1026 and 1737 stocks, respectively. To evaluate the model’s performance, three metrics were employed: mean square error (MSE), mean reciprocal rank (MRR), and cumulative investment return ratio (IRR). MSE is a widely used metric for evaluating regression tasks, whereas MRR is a metric for assessing ranking performance. The average reciprocal rank was computed for the stocks selected during the testing period. IRR, which directly reflects the effect of stock investment, was determined by summing the return ratios of the selected stocks. Smaller MSE values and larger MRR values indicate superior performance.

Table 1 Results of the RSR model
Table 2 RSR model application in Taiwan’s stock market

Table 1 presents the LSTM model as the baseline for RSR. Conversely, RankLSTM represents Feng et al.’s proposed RSR model. As indicated in the table, augmenting RankLSTM with industry relations yielded superior performance compared with the LSTM method alone. To explore the effect of integrating multiple relations, the RSR framework proposed by Feng et al. was augmented with multiple KG relations. Specifically, data from North American exchange-traded funds (ETF) was introduced into the existing KG relation, comprising 1653 ETFs and 6620 stocks. ETFs are investment funds traded on exchanges, mirroring an index, commodity, bond, or a basket of assets. They offer diversified exposure, are bought and sold like stocks, providing investors access to various markets, sectors, or asset classes in a cost-effective and transparent manner. The outcomes, detailed in the RankLSTM+KG section of Table 1 indicate that the incorporation of multiple relations enhanced the correlation among stocks and significantly increased the IRR value.

Next, different market environments are examined. Taiwan’s stock market is relatively small and volatile, which renders it susceptible to external factors. This financial uncertainty causes various fluctuations and unknown elements in the market, often influenced by political and economic news. As a result, unpredictable stock trends emerge, ultimately impacting the effective functioning of the market and capital allocation. Furthermore, the market is vulnerable to manipulation due to the substantial percentage of stocks held by major investors, some of whom may also hold positions within the companies in which they invest. Consequently, events such as elections or leadership changes within these companies can lead to substantial fluctuations in stock prices. Success in investing in Taiwan’s stock market necessitates close monitoring of these external factors and an awareness of the potential for sudden and substantial changes.

In our study, the RSR model was employed to assess the performance of the Taiwan stock market using data from the Taiwan Economic Journal (TEJ) investment database. The same features as those employed by Feng et al were used, including closing prices and 5, 10, 20, and 30-day moving averages. These experimental tests are referred to as RankLSTM in Table 2. Additionally, fundamental corporate information for individual stocks was obtained and the effects of including industry relations were assessed, denoted as RankLSTM+KG (Industry) in the table. Data on Board of directors, senior managers, and major institutional shareholders were also introduced for assessment, termed as RankLSTM+KG (Shareholder) in the table.

As illustrated in Table 2, the RankLSTM model proposed by Feng et al. achieved the best results in the top1, top5, and top10 evaluations. Its performance was further improved by augmenting the RankLSTM model with an industry KG beyond simply training on stock prices. Specifically, adding information on major institutional shareholders to the KG further enhanced the model’s overall performance. This improvement is likely attributable to the model’s enhanced ability to leverage intercompany information.

The experimental results clearly indicate that incorporating more meaningful information into the KG enhances the predictive capability of the model. The experiments also reveal that major institutional shareholders tend to influence market fluctuations in shallow-plate financial markets. Consequently, this study will expand upon this concept by employing heterogeneous KG approaches to learn the interrelationships among stocks, using corporate governance information relevant to major institutional shareholders as well as finance and economics-related factors. The trained KG embeddings are then integrated into the research framework to further augment its predictive abilities.

Fig. 3
figure 3

System overview of the proposed model

3.2 System overview

This study presents a novel and intelligent method for portfolio selection based on momentum portfolio selection, as illustrated in Fig. 3. The framework incorporates four types of data: time series data on individual stock prices, data on overall stock prices, information on stock correlations, and information on company operations represented by a KG. When extracting information for an individual stock, the system also gathers relevant data for other stocks. Each set of training data includes two individual stocks, their correlation with other stocks, the overall stock prices, and KG relations.

To process this data, a DNN is employed for analyzing time series stock price and stock correlation data. A one-dimensional convolutional neural network (1DCNN) is used to reduce the dimensionality of overall stock price data and KG embeddings. The neural network training is based on the concept of RankNet. It processes the input data, and the final score is obtained by subtracting the values of the two samples at the subtraction stage. Finally, stocks are ranked according to these scoring results.

3.3 KG Construction

As previously mentioned, our framework employs KG embedding to encapsulate structured information about various entities and their interrelations. This section aims to present a comprehensive overview of the KG embedding techniques used in our proposed framework, encompassing definitions, notations, and fundamental concepts.

Knowledge graphs have evolved from semantic networks, which initially served to represent knowledge in a graph format. In semantic networks, nodes represent concepts and edges represent the relationships between these concepts. The collection of all entities, denoted as \(E=\{e_1, ... ,e_{E}\}\), and the relationships between these entities, identified as \(R=\{r_1, ... ,r_{R}\}\), together constitute the knowledge base, represented as \(KG=\{E,R\}\). Additionally, a set of triplets \(S=\{h, r, t\}\) was generated by combining the head entity, relation, and tail entity, respectively. In this study, attention was directed towards seven types of entities and eight types of relations. The seven types of entities are listed as follows:

  1. 1.

    Company: This study includes 822 companies listed on the Taiwan stock market.

  2. 2.

    Shareholder code: The identification codes of major institutional shareholder members for each of the 822 companies were obtained from the TEJ database.

  3. 3.

    Shareholder group name: The shareholder group code and name.

  4. 4.

    Address: The city where each company is located.

  5. 5.

    Securities firm: The firm entrusted by each target company to handle securities sales.

  6. 6.

    Accounting firm: The accounting firm tasked with managing the financial affairs of each target company.

  7. 7.

    Industry category: The category to which each target company belongs.

The eight types of relations can be divided into two categories: those between companies and those related to shareholders. These relations are defined as follows:

  1. 1.

    Identity: This relation between the company and shareholder code identifies the shareholders in the target company.

  2. 2.

    Control category: This relation describes the ownership classification of shareholders, which includes five categories: ultimate controllers, managers, group managers, friendly groups, and external parties.

  3. 3.

    Entrusted entity of: This relation between the company and accounting firm describes the accounting firm engaged by the target company.

  4. 4.

    Sale of: The relation between the company and securities firm indicates the securities firm that lists the target company’s stock for sale.

  5. 5.

    Located in: This relation describes the city in which the company is located.

  6. 6.

    Industry of: This relation between the company and industry category denotes the industry to which the company belongs.

  7. 7.

    Shareholder of: This relation between the company and shareholder code denotes the shareholders of company.

  8. 8.

    Represent: This relation delineates the different shareholder codes within a shareholder group name.

The relationships within the proposed heterogeneous KG stem from the aforementioned seven entities and eight relationships. These relationships are depicted in Fig. 4.

Fig. 4
figure 4

Diagram of our heterogeneous knowledge graph

Table 3 The proposed method input KG triplets data

Table 3 presents the abstract codifications of these entities and relations on the basis of their definitions. The first column describes the entities and relations as obtained from the data set, which are then mapped to corresponding triplets in the second column. These triplets help us learn the embeddings in the KG for each entity and relation, which is pivotal for the tasks discussed in the subsequent sections.

3.4 Training process and detail

Before the RankNet framework is applied, our strategy selects pools of stocks that have exhibited the best and worst returns over a certain period for constructing the investment portfolio. The process starts by gathering the monthly returns of all stocks for the past 24 months. From this data, the top and bottom 10% of stocks are identified based on their returns during the portfolio formation period. Operational and financial data for these companies is then acquired, which includes stock prices, correlation coefficients, and KG embeddings containing heterogeneous information.

The study utilizes data on the four following pieces of information: individual sequential stock prices, overall sequential stock prices, correlation coefficients, and KG embeddings. The RankNet method is applied to rank the stocks and to construct an optimal stock portfolio based on the model’s ranking ability. The four types of input data are defined as follows:

  • Individual sequential stock price(IdvSP): The daily stock prices of the target company over the past year are used to obtain the return:

    $$\begin{aligned} r_{t} = \frac{SP_{t}-SP_{t-1}}{SP_{t-1}} \end{aligned}$$
    (1)

    where t represents the current time, SP represents the stock price, and \({t-1}\) represents the time on the previous day.

  • Overall sequential stock price(AllSP): The daily stock prices of the target company and other companies over the past year.

  • Correlation coefficient(CC): The CC formula uses Pearson correlation coefficient to obtain the correlation between the target company and other companies, where the monthly return of the past 6 months is used as a comparison value:

    $$\begin{aligned} CC = \frac{n(\sum xy)-(\sum x)(\sum y)}{\sqrt{[n(\sum x^{2})-(\sum x)^{2}][n(\sum y^{2})-(\sum y)^{2}]}} \end{aligned}$$
    (2)

    where n represents the number of pairs of stocks, \({{\sum xy}}\) represents the sum of products of paired stocks. \({{\sum x}}\) and \({{\sum y}}\) represents the sums of the x and y scores. \({{\sum x^{2}}}\) and \({{\sum y^{2}}}\) represents the sums of the squared x and y scores.

  • Knowledge graph embedding(KGE): This entails learning low-dimensional representations of entities and relations in a KG while preserving semantic meaning.

  • RankNet: Our approach randomly samples two stocks and sorts them based on the size of the predicted output result.

  • Target: This paper proposes a multi-task framework model that integrates a RankNet output for ranking two samples with an additional output for predicting future trends. Monthly returns serve as the target variable for both ranking and trend prediction tasks.

For IdvSP and CC, a three-layer DNN [64, 32, 16] is utilized for processing, whereas for AllSP and KGE, a three-layer 1DCNN [64, 32, 16] is employed for dimensionality reduction to obtain embeddings. During training, to ensure that data values are neither excessively high nor low, stock prices are standardized using (1) to scale values within [0, 1]. The model incorporates various heterogeneous data as input features, resulting in complex data characteristics prone to overfitting issues. To address this, two parameter settings were introduced: Dropout and Early Stopping. In order to mitigate overfitting during training, a Dropout rate of 0.5 is employed. Furthermore, "Early Stopping" is implemented to monitor the performance on the validation set and halt training prematurely in case of performance abnormalities. This measure also aids in preventing overfitting during training. In determining the learning rate value, rates of 0.1, 0.01, 0.001, and 0.0001 were experimented with. The results revealed that excessive learning rates led to unstable training, while rates that were too small resulted in the model’s slow convergence. Based on these findings, a learning rate of 0.001 was chosen for subsequent experiments in this paper.

Regarding Batch Size configuration, sizes of 64, 128, 256, and 512 were tested. During the testing process, it was observed that smaller batch sizes led to slow convergence on complex features, while larger ones required more memory and computational resources. The study also found that the accuracy of Batch Size 128 and 256 was very similar, but the computational resource consumption of Batch Size 512 was significant. Therefore, subsequent research adopted a batch size of 256 for consistency. In optimizer parameter settings, tests were conducted on ”SGD”, “Adagrad”, “RMSprop”, and “Adam”. After multiple rounds of iterative testing, stable training results were consistently delivered by “Adam”, surpassing “SGD” and “Adagrad”. The latter exhibited unstable loss functions and poor accuracy predictions during training. While “RMSprop” mostly achieved good results, its performance fluctuated. The adaptive adjustment of learning rates for each parameter in “Adam” made it the optimal choice, consistently producing favorable outcomes after numerous tests.

Table 4 Number of layers, configuration settings, and feature quantities in the neural network

Concerning the use of Activation functions, the model utilized "Sigmoid", “ReLU”, and “Linear”. “ReLU”, which truncates negative values to zero, was suitable for the training data. “Sigmoid”, mapping inputs to values between 0 and 1, was employed in the RankNet training framework. “Linear” was used for the final output in the framework. To prevent the problem of dying ReLU, avoiding a high learning rate and a larger negative bias, this paper adopted a learning rate of 0.001, which also helps mitigate the occurrence of this phenomenon. The remaining parameters are configured as follows: loss=”binary crossentropy”, and metrics=”accuracy”. The detailed architecture of the network, including layer numbers, configuration settings, and feature quantities are presented in Table 4 within this framework. Subsequent sections further elucidate the research procedure and provide detailed pseudocode, as presented in Algorithm 1.

After implementing the specified parameter settings and conducting model calibration on the necessary financial data, ready for recall experimental results. To ensure stability and reliability, several training iterations are typically performed. Additionally, certain financial data features incorporated into the model may require authorization from relevant institutions, potentially leading to some minor limitations in data acquisition. Following the acquisition of the required data features and parameter settings, the model undergoes multiple training iterations, typically around five times. The results generated from the execution closely align with the data presented in the paper.

Algorithm 1
figure d

Proposed method to training process and detail

3.5 Market anomalies

Market anomalies frequently occur in real financial markets and are typically categorized into three main types: time series anomalies, cross-sectional anomalies, and other market pricing anomalies. Time series anomalies refer to market volatility that fluctuates predictably over time. For example, the January effect, a calendar anomaly or the turn-of-the-year effect, is often associated with additional stock market rises in January. Weekly patterns also exist, such as the tendency for stocks to exhibit greater movement on Fridays than on Mondays, with the market often closing higher on Fridays. Tung et al. [40] studied the constituent stocks of the S &P 500 index to explore the effect anomaly. They focused on the phenomenon where the daily return on Monday is typically lower than on the preceding Friday, known as the “week effect.” This study improves upon traditional trend forecasting and observation of buying and selling signals by applying AI methods and time series modeling to identify factors that accentuate daily effects. A predictive model for the occurrence of day-of-the-week effects is constructed through a five-stage experimental process. The results indicate that the constituents of the S &P 500 index exhibit a day-of-the-week effect.

Another anomaly is the momentum and overreaction anomaly, where the market tends to overreact, leading to stock prices continuing their prior trends. Jegadeesh and Titman [9] proposed the momentum effect, which suggests that stocks will persist in their recent directional trend. This observation suggests that stocks exhibiting higher recent returns are likely to maintain their performance going forward. Conversely, the reversal effect suggests that stocks will move opposite to their original trend in the future, marking turning point and a contrarian response.

In addition, anomalies in financial markets can be detected using financial time series data. Cheong et al. [41] introduced the spatiotemporal convolutional neural network-based relational network (STCNN-RN) to capture intricate correlations among multiple financial time series data sets and detect anomalies. To identify outliers in specific companies, genetic algorithms are employed. Studies have often fallen short in providing comprehensive explanations of these anomalies to investors. In this study, an interpretability model is employed to shed light on the timing of these company-specific anomalies and uncover the critical factors contributing to them. Empirical evidence from our experimental supports the efficacy of the proposed model in modeling various financial time series data sets and accurately detecting anomalies within related firms.

3.6 RankNet construction

RankNet is a neural network model proposed by Microsoft Research in 2005 for learning-to-rank. It transforms the ranking problem into a binary classification problem, aiming to predict the probability that one item is ranked higher than another. RankNet calculates the probability of the correct order between two documents and thereby simplifies the process for a sequence of documents by requiring probability calculations only for pairs of adjacent documents, reducing computational complexity. RankNet, which has been employed in various applications such as search engines, recommendation systems, and stock portfolio selection, consists of a multilayer perceptron (MLP) with two inputs and one output. The inputs are feature vectors representing two items to be compared, and the output is the probability that the first item ranks higher than the second.

The RankNet architecture requires defining two types of relevance probabilities for ranking: the predicted relevance probability and the true relevance probability. These probabilities re represented by the formula as illustrated in (3) and (4).

$$\begin{aligned} P_{ij} = P(U_{i}>U_{j})=\frac{1}{1+e^{-\sigma (s_{i}-s_{j})}} \end{aligned}$$
(3)
$$\begin{aligned} \overline{P_{ij}}=\frac{1}{2}(1+(s_{ij})) \end{aligned}$$
(4)

\({U_{i}}\) and \({U_{j}}\) represent any sample pair. The parameter \({\sigma }\) shapes the sigmoid function. \({s_{i}}\) and \({s_{j}}\) are the respective score output by the model.

RankNet evaluates the quality of a ranking by analyzing the relative relationships among multiple documents. The quality is higher when fewer pairs have incorrect relative relationships, which when a model incorrectly ranks \({U_{i}}\) ahead of \({U_{j}}\), despite the true label indicating the opposite. RankNet aims to minimize the number of such incorrect pairs. When formulated as a cost function, RankNet incorporates the notion of probability, focusing on the likelihood P that \({U_{i}}\) is ranked ahead of \({U_{j}}\), and it aims to minimize the difference between this predicted probability and the true probability. Finally, RankNet employs the cross-entropy cost function to quantify the level of fitting. The cost function formula is displayed in (5).

$$\begin{aligned} C = -\overline{P}_{ij}logP_{ij}-(1-\overline{P}_{ij})log(1-\overline{P}_{ij}) \end{aligned}$$
(5)

To maintain consistency, RankNet ensures that if \({U_{i}}\) is more relevant than \({U_{j}}\), and \({U_{j}}\) is more relevant than \({U_{k}}\), then \({U_{i}}\) should also be more relevant than \({U_{k}}\). The true probability of \({U_{i}}\) versus \({U_{k}}\) is calculated using the true probabilities of \({U_{i}}\) versus \({U_{j}}\) and \({U_{j}}\) versus \({U_{k}}\), as expressed in (6). The cross-entropy loss function is utilized during training by RankNet to minimize the difference between the predicted and actual rankings.

$$\begin{aligned} \overline{P}_{ik} = \frac{\overline{P}_{ij}\overline{P}_{jk}}{1+2\overline{P}_{ij}\overline{P}_{jk}-\overline{P}_{ij}-\overline{P}_{jk}} \end{aligned}$$
(6)

4 Experimental Results

This chapter introduces the dataset and experimental design employed in this study. It then proceeds to compare various baseline methods, including traditional random choice, momentum-based, dual-classifier model, MLP, and the RSR model. Finally, the effectiveness of different methods is evaluated using performance indicators.

4.1 Dataset and experimental setup

Data from the TEJ investment database was analyzed, with a focus on companies listed on the Taiwan stock market. As of 2022, there were 971 listed companies in the Taiwan stock market. However, companies that delisted during the study period or had considerable missing data were excluded, leaving a total of 822 stocks for analysis. The period covered by the research data is from January 1, 2015, to December 31, 2022. A rolling window approach was employed, utilizing the past year’s data as learning features to predict the monthly return for the subsequent month. This approach was applied in both RankNet and future trend forecasting. The data pertain to closing stock price, company operations, major institutional shareholders, and inter-company correlations. The closing price is used to calculate the return. The operational information and shareholder information includes the company ticker (Tic), shareholder code (denoting identity and control), shareholder group name, address, securities firm, accounting firm, and industry category. This information contributes to construction of a KG, and the monthly returns of each stock over the past 6 months are used to determine the correlation between stocks.

The data are divided into four categories for input into the neural network, which utilizes the rectified linear unit (ReLU) activation function. The details of each type of input data are as follows:

  • Individual sequential stock price (IdvSP): The return of the target company in the past year is obtained, and DNN [64, 32, 16] is used to extract features and generate embedding after dimensionality reduction.

  • Correlation coefficient(CC): The correlation between the target company and other companies is obtained, using the monthly return in the past 6 months as a comparison value. DNN [64, 32, 16] is used to reduce dimensionality and generate embedding.

  • Overall sequential stock price(AllSP): The return of the target company and other companies in the past year is obtained, and 1DCNN [64, 32, 16] is used to extract features, followed by generating embedding after dimensionality reduction.

  • Knowledge graph embedding(KGE): Information such as Tic, shareholder code (denoting identity and control), shareholder group name, address, securities firm, accounting firm, and industry are used to construct a KG. This graph illustrates the relationship between all companies and yields embeddings that represent these relationships. For the company with Taiwan stock market Tic 1101.tw, Table 5 lists the top 10 records used. “Shareholder code” is the identifier for major institutional shareholders. “Identity” denotes the role of these shareholders within the company. “Control” is categorized as follows: “A” for ultimate controllers, exerting the most significant influence over the company”s management and resource allocation decisions; “B” for managers, “C” for group manager, “L” for friendly groups, encompassing blood relations, marital relations, partnerships, mergers, divisions, or state ownership, and “X” for external individuals. “Shareholder group name” identifies the group to which the major institutional shareholders belong. “Address” denotes the city where the company is located. “Securities firm” identifies the entrusted securities firm, and “Accounting firm” refers to the appointed accounting firm handling the company”s financial affairs. “Industry” specifies the company”s business category. Taking the company with Tic 1101.tw as an example, the shareholder code “P000160025” is the code used to indicate independent directors and supervisors in 1101.tw. For the same period, the same code “P000160025” also indicates the same for 2535.tw and 2617.tw. This study incorporates these relationships into the training model to enhance predictive performance by leveraging the interlinkages among individual stocks.

Table 5 The company and director-supervisor information for 1101.tw

After obtaining the four mentioned input data types, samples are randomly generated for input into the RankNet architecture. RankNet aims to sort two randomly selected stocks from the samples based on the magnitude of the prediction outcome. The target of this study is to predict the monthly return for the next month.

4.2 Performance metrics

Several key indicators, including the Sharpe ratio, Calmar ratio, maximum drawdown (MDD), and Sortino ratio, are utilized in this study to evaluate performance.

The Sharpe ratio is a measure of the amount of excess return earned for each unit of risk. It is commonly used to evaluate the performance of long-term investments such as funds and asset allocations. The Sharpe ratio is defined as follows:

$$\begin{aligned} Sharpe Ratio = \frac{_{Rp}-_{Rf}}{_{\sigma p}} \end{aligned}$$
(7)

\({_{Rp}}\) and \({_{Rf}}\) represent the portfolio’s return and risk-free rate, respectively. \({_{\sigma p}}\) represents the standard deviation of the portfolio’s excess return.

The Calmar ratio quantifies the relationship between returns and the maximum drawdown, which is calculated by dividing the annualized rate of return by the historical maximum drawdown. A higher Calmar ratio indicates more favorable fund performance. The formula for the Calmar ratio is as follows:

$$\begin{aligned} Calmar Ratio = \frac{_{Rp}-_{Rf}}{MDD} \end{aligned}$$
(8)

\({_{Rp}}\) and \({_{Rf}}\) represent the portfolio’s return and risk-free rate, respectively. MDD represents the maximum drawdown.

The maximum drawdown measures the largest decline in an account’s net value from its peak. It represents the worst possible scenario for an investment entered at any point in time.

The Sortino ratio evaluates a portfolio’s performance and stability in fund investment or asset allocation. It quantifies the amount of return achievable per unit of risk taken.

$$\begin{aligned} Sortino Ratio = \frac{_{Rp}-_{Rf}}{_{\sigma d}} \end{aligned}$$
(9)

\({_{Rp}}\) and \({_{Rf}}\) represent the portfolio’s return and risk-free rate, respectively. \({_{\sigma d}}\) represents the standard deviation of negative returns.

The main difference between the Sortino ratio and the Sharpe ratio lies in how they account for risk. The Sortino ratio focuses solely on downside risk, measuring the excess return per unit of downside risk (represented by the standard deviation of negative returns). This reflects the return earned for each unit of risk resulting in a loss. Conversely, the Sharpe ratio considers both the downside and upside risk and quantifies the additional return per unit of total risk, as indicated by the standard deviation of all returns.

4.3 Performance evaluation

In this section, various baseline methods are compared over a rolling test period from 2018 to 2021. These methods include the random choice method, traditional momentum-based method, dual-classifier model, MLP, RSR, and our proposed method.

Before comparing these methods, stocks exhibiting momentum are identified. Tests at various time intervals, ranging from the 6th to the 24th month at 6-month intervals, are conducted for the formation and holding periods in momentum trading. The results reveal that the optimal profitability has 24-month formation and holding periods; thus, these periods were used in subsequent experiments. Next, the concepts and practices of the different baseline methods are outlined. In the random choice method, 10% of stocks are randomly selected from either the positive or negative monthly return stock pools. Decisions to assume long or short positions, are based on positive or negative returns, respectively, and performance is evaluated based on daily returns over a 24-month holding period. In the traditional momentum-based method, with a 24-month formation period used as a reference, individual stock returns are ranked. Subsequently, the top and bottom 10% of returns are selected to form stock pools. Long positions are taken based on positive returns, whereas short positions are based on negative returns for trading. Finally, the performance of the investment portfolio is then evaluated using daily returns over a 24-month holding period. The distinction between this method and the random choice method lies in the selection criteria: the random choice method involves selecting stocks randomly from the stock pool, whereas the momentum-based method involves selecting the best and worst-performing stocks from the pool to create long and short investment portfolios.

The dual-classifier model involves constructing a dual-classifier through the application of the momentum contrarian strategy. The procedure for selecting the top and bottom performing stocks mirrors that of the previous baseline methods. After the initial long-short investment portfolio is established, candidates for stock selection are chosen based on past contrarian trading results and future trend persistence as defined by binary targets using monthly return as a criterion. The classifier, which utilizes support vector machines (SVM), is tested. After the dual-classifier is trained, its results inform adjustments to the investment portfolio. In the MLP method, the SVM approach of the dual-classifier model is replaced with a four-layer neural network [32, 16, 8, 1]. The sigmoid activation function is used for the first layer, and the monthly return is the target. Stocks are sorted based on the predicted results, and the top and bottom 10% are selected to construct the investment portfolio.

The final baseline method, RSR, involves inputting sequential historical prices into LSTM and incorporating KG embedding similar to the proposed method. The experimental settings follow the guidelines of the original study, with a historical sequence length of 4 for features, 64 hidden units in LSTM, and “leaky\(\underline{\hspace{0.2cm}}\)relu” as the final layer’s output activation, with a learning rate of 0.001. After the model makes its predictions, the best and worst-performing stock pools are selected for portfolio evaluation. For a conceptual understanding of the RSR model, please refer to Chapter 3.1.

As for the proposed method, consistent with the intervals used in the baseline methods, this experiment extracts monthly returns from T to T-24 as the formation period to screen the top and bottom 10% of stock pools and identify candidate stocks. The training data for these candidate include data on daily returns for the past 250 days, KG embedding after dimensionality reduction, similarity calculated from monthly returns over the past 6 months between two stocks, and the daily return of all candidate stocks. All these data are inputted into the training model. These heterogeneous data are processed through CNN and DNN for dimensionality reduction and embedding generation to map different data types to the same dimension, achieving the goal of combining heterogeneous data. For detailed network parameter settings, refer to Chapter 3.4.

In the experiment of this study, data were obtained from the TEJ database. Stocks with missing data or significant anomalies were excluded. In cases where minor missing or anomalous data were encountered, zero values were substituted. Table 4 presents experiments conducted over different time intervals from 2018 to 2021. Each interval included two years of training samples and one year of defined targets. To prevent overfitting of the training data set due to the limited number of training samples, an extensive number of samples were generated from the candidate stocks. For instance, when employing a long or short strategy with 160 candidate stocks, 160 × 159 = 25440 samples were generated, each comprising the four types of heterogeneous data previously mentioned.

Table 6 Results of different testing years in 2018-2021
Fig. 5
figure 5

Comparing the Sharpe ratio results of different periods between the baseline and the proposed method

Table 7 Results of the ablation study in 2020

The experimental results, as illustrated in Table 6, indicate that the random choice method, which involves randomly selecting stocks based on positive and negative returns, consistently exhibited the poorest performance each year, as indicated by the lowest Sharpe ratio values. The traditional momentum-based method, which selects 10% of stocks from positive and negative return pools to construct investment portfolios, yielded a superior Sharpe ratio than the random choice method. The dual-classifier model refined the traditional momentum-based stock selection method by applying binary classification for readjusting the selected stock group’s portfolio, resulting in a higher Sharpe ratio compared with the traditional approach.

Using the MLP method, the test initially selected the top 10% of stocks using the traditional momentum-based approach and then applied a multilayer neural network to further adjust the investment portfolio. Except for a slight decrease in the Sharpe ratio in 2018, the performance from 2019 to 2021 surpassed that of the dual-classifier model. The RSR method, which incorporated the graph relationships between stocks, enhanced the model’s predictive capabilities. The experimental findings indicate that the RSR method outperformed the previous methods in terms of Sharpe ratio from 2018 to 2021.

Figure 5 illustrates a comparison of Sharpe ratio results across different periods between baseline methods and the proposed method. The figure demonstrates that the random choice and traditional momentum-based methods yield inferior Sharpe ratio results. Conversely, our proposed method, which incorporates heterogeneous data such as company operations data and financial domain knowledge, not only enhances the model’s ranking capability but also achieves the highest performance in terms of Sharpe ratio values.

4.4 Ablation study

This section presents an ablation analysis of our proposed method. Table 7 presents the results for the test year 2020. Four inputs, namely IdvSP, CC, AllSP, and KGE, were analyzed.

Initially, when using IdvSP and AllSP as inputs for the same period, the Sharpe ratio value is positive at 0.9703, indicating robust forecasting ability from stock price data alone. Adding the correlation CC between individual stocks reveals its effect on the model, improving predictive ability and yielding a Sharpe ratio of 1.0405. Incorporating company operations information KGE further improves the Sharpe ratio to 1.1772, indicating enhanced predictive power with additional stock information. When the model incorporates IdvSP, CC, and KGE, the Sharpe ratio reaches 1.2955, indicating improved predictive ability. Including all data types, the model achieves a Sharpe ratio of 1.5120, indicating improved predictive ability with comprehensive knowledge.

Table 8 Results of the ablation study in 2021
Fig. 6
figure 6

Illustrates the comparison of Sharpe ratio outcomes derived from the proposed method’s ablation study

Applying the same method, Table 8 presents the test results for the year 2021. Initially, with only IdvSP and AllSP as inputs, the Sharpe ratio is 0.9105. However, including CC significantly enhances predictive power, resulting in a Sharpe ratio of 1.1442. Furthermore, adding KGE further improves the model, increasing the Sharpe ratio to 1.2598. Incorporating IdvSP, CC, and KGE together leads to a strong Sharpe ratio of 1.3213. Finally, with all data types inputted, the model’s learning and predictive abilities further improve, as evidenced by our proposed method’s effective stock price prediction.

Figure 6 visually represents the results of the ablation study for 2020 and 2021. The graph depicts that all Sharpe ratios are positive, with the highest value achieved when all four types of information are included. This implies that the model’s learning ability and predictive capabilities improve with the incorporation of more comprehensive information.

4.5 Top-N percentage evaluation

Next, the predictive ability of our model across different top-N percentages is discussed. Experiments are conducted using 10%, 15%, and 20% of stocks for the years 2020 and 2021. The experimental results are presented in Table 9. The experimental results using 2020 data indicate that with 10%, 15%, and 20% of stock included, the obtained Sharpe ratios are positive, amounting to 1.5120, 1.3991, and 1.2636, respectively. These findings indicate that our model has strong predictive ability at any selected percentage of stocks. Specifically, the best experimental results are obtained when using 10% of the stocks for prediction.

Table 9 Results of extract top-N percent stock
Table 10 Results of Different Period

The experimental results using 2021 data indicate that with 10%, 15%, and 20% of stock included, the obtained Sharpe ratios are positive, amounting to 1.4304, 0.0926, and 1.1973, respectively. Again, the best experimental results are obtained when using 10% of the stocks for prediction. Based on these results, it is concluded that the experiments using 10% of the stocks consistently delivered the best performance, establishing 10% as the baseline for all experiments in this study.

4.6 Different periods of formation and holding period results

This section examines the performance of our model over different formation and holding periods. The results are presented in Table 10. Profit performance for formation and holding periods of 6, 12, 18, and 24 months was tested, with these tests conducted for both 2020 and 2021. For 2020, the Sharpe ratio for the short-term 6-month period is negative, with a value of -0.9522. However, extending the holding period to 12 months results in a positive Sharpe ratio of 0.8575. With holding periods if 18 and 24 months, the Sharpe ratios are positive, at 0.6845 and 1.3949, respectively. Applying the same experimental design to test 2021, the results indicate that the short-term 6-month Sharpe ratio is negative at -0.8322. For holding periods of 12, 18, and 24 months, the Sharpe ratios are positive, with values of 0.6261, 0.7938, and 1.4304, respectively.

Based on these findings, a limited backtesting period of only 6 months for the formation and holding periods leads to unstable profit performance. For formation and holding periods of 12, 18, and 24 months, the Sharpe ratios are consistently positive. Notably, the highest and most positive Sharpe ratio is observed when the period is set to 24 months. Therefore, for all subsequent experiments, this study uses 24 months as the baseline period.

4.7 Testing different trading markets

This study selected NASDAQ as the test market to verify the performance of different trading markets. NASDAQ is a prominent U.S. stock exchange known for its technology and biotech listings. It offers a wide range of securities, from traditional stocks to innovative offerings. The research data covers the period from January 1, 2015 to December 31, 2022, divided into four time intervals for testing. The input features mirror those used in testing the Taiwan stock market, comprising daily returns from the past year, monthly returns from the previous 6 months, daily returns of other stocks over the past year, and financial information-based KG. Notably, there’s a difference in the KG information due to difficulties in obtaining board of directors and supervisors’ information in the U.S. stock market. Hence, data was obtained from 158 ETFs, totaling 1,652 constituent stocks. Each ETF includes multiple constituent stocks, establishing associations across multiple ETFs akin to Taiwan’s board of directors and supervisors’ information.

Table 11 Results of NASDAQ testing across different years from 2018 to 2021

The training dataset exhibits a significant correlation with the experimental results. When the dataset is segmented into different time intervals or influenced by the interactive effects of the prevailing environment, uncertainty factors emerge, leading to noticeable variations in the training outcomes. Examining Table 10 across different testing periods of 6, 12, 18, and 24 months, the experimental results reveal that 24 months yield the optimal outcomes for different testing years. Conversely, the shortest period of 6 months produces the poorest results, indicating that different segmentation intervals impact the experimental outcomes.

Tables 6 and 11 present similar findings, indicating that the selection of different data features results in varying impacts. RSR model, as shown in the tables, utilizes only stock prices, industry classification, and relationships with upstream and downstream sectors as input features. Despite achieving positive returns, its profitability slightly lags behind the Proposed model. In this study incorporates diverse data features, allowing it to learn more information and exhibit higher sensitivity to environmental changes. Coupled with our carefully selected parameter configurations, it achieves superior predictive capabilities. The proposed model yields positive returns in different intervals, specifically 1.2656, 1.3588, 1.2074, and 1.6898, compared to the RSR model. This highlights that with more comprehensive input features, the proposed model provides more assistance during training, resulting in favorable outcomes.

The experimental framework, parameter settings, and network layers mirror those used in testing the Taiwan stock market. In addition, this study includes the RSR model for comparison, utilizing the same feature parameters as described in Section 3.1. As shown in Table 11, this demonstrates commendable predictive capabilities when tested in different time intervals within the NASDAQ trading markets.

5 Conclusion

This study proposes a novel multitask supervised learning approach using a learning-to-rank algorithm to construct a portfolio selection framework in momentum trading. This framework integrates heterogeneous data, including company operations data and major institutional shareholder details, forming a relationship chain graph. This integration, coupled with time series stock price data, aims to address a gap in the literature on intelligent trading strategies, where the focus has been on individual stock price changes.

The experimental results demonstrate the strong performance of our model in ablation analysis when all stock information is included, allowing the model to obtain more knowledge and improve its predictive ability. The obtained Sharpe ratios are favorable. In the top-N percent experiment, the Sharpe ratios for 10%, 15%, and 20% are all positive, confirming the model’s robust predictive ability at any selected percentage. The most effective results occur with 10%. Our model, when its parameters are optimized, outperforms other baseline methods for the 2018-2021 period. This model can efficiently select the top-N percent of stocks demonstrating momentum behavior from a massive stock pool and recommend these stocks for portfolio inclusion. It also offers valuable portfolio adjustment suggestions to investors.

Future studies can (1) explore different KG approaches to construct embeddings using various node measurement methods and (2) investigate the use of different learning-to-rank algorithms beyond RankNet for testing.

Our belief is that combining heterogeneous data to construct momentum trading strategies, coupled with learning-to-rank algorithms for ranking, can further enhance