1 Introduction

In the 1950s, Harry Markowitz proposed the Mean-Variance Analysis Theory and Portfolio Theory, providing a solution to the discount rate issue [1]. Following this, economists represented by Sharpe, Lintner, and Mossin built upon Markowitz theory from various perspectives and proposed the Capital Asset Pricing Model (CAPM) [2]. In the 1970s, Fama summarized hypotheses on the random fluctuations of stock prices and the fairness of the market, leading to the Efficient Market Hypothesis (EMH) [3]. He delineated three forms of efficient markets: weak-form efficiency, semi-strong-form efficiency, and strong-form efficiency. It can be said that a group of financial economists, with Markowitz at the forefront, made significant contributions to the axiomatization of financial economics [4]. It is on the foundation of these financial theories, combined with modern computer technology, that a large number of quantitative trading models have been developed.

Portfolio management involves building, overseeing, and selecting investments that meet the long-term financial goals and risk tolerance of an investor. Actually, the financial portfolio can be regarded as a sequential decision-making problem. In reality, investors observe the price changes of financial assets, and adjust the investment proportions of assets in real time in order to achieve higher investment returns. How to dynamically adjust the proportion of assets in the entire investment process is the core problem that needs to be solved. The continuous iteration of new technologies has greatly promoted research in the field of portfolio management, including the early use of Markov models, as well as the current application of neural networks and reinforcement learning. Neural networks represent a robust machine learning methodology, particularly effective in identifying and processing patterns within extensive and intricate datasets. Within the realm of portfolio management, they are leveraged for forecasting stock prices, evaluating asset risk, and enhancing outcomes in high-frequency trading scenarios.

Recently, reinforcement learning (RL) has been applied in portfolio management research. As illustrated in Fig. 1, an RL agent can simulate an investor, dynamically adjusting asset weights based on price changes to achieve higher returns. However, traditional RL algorithms always used the Q-table to represent the relationship between states and actions, and thus are not sufficiently robust for representing complex states. Deep reinforcement learning (DRL) [5] is a new and powerful technology that can be effectively even used under complex environments. In a DRL-based model, complex features from heterogeneous data are extracted using a deep learning (DL) algorithm and then input into RL algorithms. AlphaGo [6] successfully used DRL to address real-world problems. Furthermore, DRL is gradually gaining popularity for addressing portfolio management problems through the adjustment of reward functions, states, and actions that can be embedded as complex financial constraints into the portfolio management model for efficient investment trading.

Fig. 1
figure 1

Reinforcement learning versus portfolio management

1.1 Related Works

The existing research on portfolio management utilizing reinforcement learning is divided into value-based and policy-based models. Enhancements can be achieved through deep networks, incorporating extra features, and employing multi-agent systems. Table 1 presents the cutting-edge models in reinforcement learning.

Table 1 A description of the state-of-the-art reinforcement learning models on portfolio management

1.1.1 Value Based Algorithms and Systems

Many studies have attempted to improve the effects of recursive reinforcement learning (RRL) to build financial trading systems. Here RRL is the value-based reinforcement learning algorithm with temporal recursive update of the q-values. Moody et al. [7] trained a financial system based on RRL and used the differential Sharpe ratio as the objective function. Jangmin et al. [8] proposed a meta-strategy based on RRL to flexibly allocate risk and risk-free assets. Bertoluzzo [9] constructed an objective function for RRL using a weighted symmetric index to control the downside risk. Maringer et al. [10] proposed a TRRL trading strategy to set the layer of the RRL network to achieve different investment styles. Then, they proposed RS-RRL [11] to choose appropriate asset weights according to price volatility for effective market adaptation. Bertoluzzo et al. [12] used the value function to implement a financial trading system and introduced the time difference and kernel function into it, thereby providing flexibility to state and action setting. Du Class [13] established buying and selling operations for price changes. Their empirical study showed that RRL had superior convergence performance.

1.1.2 Policy Based Algorithms and Systems

Sutton et al. [14] proposed a policy gradient (PG) algorithm to address the problem encountered in value-based RL methods. In comparison to the value-based RL method, the policy-based methods try to optimize the state value function. Eilers et al. [15] used state transfer functions in the PG algorithm to transform the transaction style, and may result in the functions to fall into local optimization. Stelios et al. [16] constructed a self-adaptive fuzzy financial trading system based on the actor-critic algorithm, that is, using a value-based function as the critic, and using a policy-based function as the actor, in which eight states were set considering the expected returns and volatility.

1.1.3 DRL Based Algorithms and Systems

With the development of deep neural networks (DNN), attempts have been made to use these networks to build quantitative trading models. Deng et al. [17] reduced the uncertainty associated with financial data using fuzzy learning, extracted the financial characteristics using a neural network, and proposed an FRDNN financial trading system. As a positive result, more and more RL methods employ the DNN to make improvement, that is, the DRL method. It should be noticed that DRL methods have been widely used in the field of portfolio management. Jiang et al. [18] proposed a DRL based portfolio framework composed of an ensemble of identical independent evaluators, portfolio-vector memory, and online stochastic batch learning. This framework was designed using the deep deterministic policy gradient (DDPG) [19] algorithm, in which the convolutional neural network (CNN) [20], recurrent neural network (RNN) [21] and long short-term memory (LSTM) [22] networks were used by the agent, and the experimental results showed that the developed framework is suitable for application in the cryptocurrency market. Xiong et al. [23] built a stock portfolio system using proximal policy optimization (PPO) [24] on a deep network, and achieved an integrated trading strategy ensemble [25], which can be adapted to different market conditions. Liang et al. [26] employed PPO, DDPG, and PG in conjunction with a residual network. Furthermore, Liu [27] designed a stock automatic trading library based on a variety of DRL algorithms, such as FINRL. Since only a few studies have considered the correlation between assets, Wang et al. [28] proposed a DRL investment strategy AlphaStock wherein a self-attention mechanism was introduced to model the relationship between the assets.

1.1.4 DRL Methods with Additional Features

An increasing number of studies have attempted to optimize the agent network structure using DL models, such as the Transformer [29], and the GCN [30], to solve traditional RL issues such as insufficient learning ability of the model when using heterogeneous data. Encoding and embedding market-event information into DRL algorithms is a new research direction in this field. Ye et al. [31] proposed the DRL framework named SARL to enhance states using the encoded features of financial news. Daiya et al. [32] proposed Trans-DiCE to extract financial time-series features with a self-attention layer as well as a causal CNN and extracted the news features using a transformer encoder. Wang et al. [33] proposed the DRL model DeepTrader, which used the market sentiment index to balance risks and returns. In DeepTrader, the time convolution layer TCN [34] is used to obtain the temporal features, the spatial attention mechanism is used to obtain the short-term spatial features, and the convolution layer is used to obtain the long-term spatial features [35]. The constructed asset evaluation unit is then used to extract the relationship between assets.

1.1.5 Multi-agent Based DRL Methods

Several studies have focused on multi-agent RL in quantitative transactions. Lee et al. [36] used the deep Q-learning network (DQN) [37] algorithm to construct a portfolio management system, MAPS, using a collaborative multi-agent to punish the decision similarity between agents by constructing a global loss function. Huang et al. [38] proposed a multi-agent-based resource-scheduling system, MSportfolio management, to address the problem of increasing assets and heterogeneous data input. Pham et al. [39] proposed a DRL-based model using collaborative multi-agent to automatically construct hedging strategies. Lussange et al. [40] presented a model that considers fairness among agents.

1.1.6 Portfolio Management Methods

In addition to methods based on DRL, there are also other advanced approaches used for portfolio management.

Portfolio optimization: Some researchers explore the solution using evolutionary computation methods with global optimization capability, Yaman et al. [41] proposed a hybrid approach to cardinality constraint portfolio selection problem based on nonlinear neural network and genetic algorithm. Khan et al. [42] formulated quantum beetle antennae search to address the constrained portfolio optimization problem. Cao et al. introduced two notable approaches aimed at tackling the challenges associated with high-frequency trading. The first method innovates with neural networks featuring softmax equalization [43], while the second approach develops a recurrent neural network, accompanied by a comprehensive theoretical analysis of its convergence properties [44].

Turnover prediction on portfolio: Turnover has been verified to be highly effective on cost impact, market timing, risk management, and investment strategy. Ding et al. [45] have developed novel formulas for calculating turnover and leverage in mean-variance optimal long-short market-neutral portfolios. These formulas utilize active weights derived from a factor model conditional mean forecast and a conditional forecast error covariance matrix, accurately reflecting the risk.

Portfolio risk prevention: Risk plays a pivotal role in portfolio management, a fact underscored by numerous studies [46]. One effective strategy for mitigating risk is the implementation of cardinality restrictions, a method proven to effectively lower risk levels [47]. When combined with neural networks, this approach not only reduces risk but also enhances potential rewards, offering a more balanced risk-reward profile [48].

1.2 Main Contributions

DRL has demonstrated desirable performance in the area of portfolio management. Several newly developed DL models have also been used to achieve superior performance in this field. However, as we noted, most of studies mainly focused on improving the network structure and did not consider the impact of the various factors on model performance. In the proposed models, the states were limited to stock quotes, such as the opening and closing prices per day. However, price is the result of internal interactions in the financial market, which makes it difficult to determine the long-term trend of stock change using existing models; thus, these models may easily overfit as the high uncertainty of the investment environment. Some studies such as the SARL, Trans-DiCE, and DeepTrader, have used external information to extend the states of the DRL model, which are high feasible and can be made a deeper research. Thus, we propose a multi-agent DRL model called Collaborative Multi-agent reinforcement learning-based stock Portfolio management System (CMPS), which incorporates the features of financial indexes that can affect stock prices into our model and combine it with traditional financial theory to effectively learn the asset correlations. The contributions of this study are as follows:

  1. 1.

    A collaborative three-agent DRL model called CMPS is proposed to address the problem of portfolio management. Each agent has a DQN structure to extract different features, and all the features are then combined with a self-attention network to achieve a comprehensive reward.

  2. 2.

    In the CMPS model, a data fusion method is used to obtain additional features. Specifically, the financial reports of stocks are added as additional information fused with real-time trading information. This combination enriches the state of the DRL agent, thereby increasing the ability of long-term prediction. In comparison to the financial news using in DeepTrader and other simple methods, the financial reports can more precisely reflect the long trend of an asset. Moreover, the correlations between stocks are embedded based on the classic theory of Markowitz mean-variance to provide the model with more information in finding the relationship among the financial assets.

  3. 3.

    Portfolio management aims to prevent financial risks, and so, we propose a risk-free asset strategy in the CMPS, called CMPS-RF (Risk Free), wherein a weight is used to balance the risk and risk-free stocks. Thus, our model can effectively avoid the investment risk in the stock market under large fluctuations.

  4. 4.

    We collected stock data from the China Shanghai Stock Exchange (SSE) 50 Index and generated two datasets reflecting the stable and downside situations of the stock market. Five state-of-the-art (SOTA) models were used to evaluate our models based on six financial metrics. The experimental results showed that our CMPS model outperforms other models, and our CMPS-RF model has significant advantages in avoiding risks.

1.3 Paper Organization

The rest of this paper is organized as follows. Section 2 introduces the preliminaries of RL. Section 3 presents our two models, CMPS and CMPS-RF. Section 4 evaluates our models with experimental results and theoretical analysis. Finally, Sect. 5 concludes the study.

2 Preliminaries

The financial portfolio can be defined as follows. For n financial assets, such as stocks, precious metals, futures, denoted as \(F=\{f_{1},f_{2},\cdots ,f_{n}\}\). The price at a certain time is \(P^{t}=\{p_{0}^{t},p_{1}^{t},\cdots , p_{n}^{t}\}\}\), in which \(p_{0}^{t}\) denotes the currency, and its price is always 1 at any time. Given an initial investment capital \(C_{0}\), we can adjust the investment proportions of assets \(W^{t}=\{w_{0}^{t}, w_{1}^{t},\cdots ,w_{n}^{t}\}\) at different times, and maximize the final return \(C_{t}=C_{0}\prod _{t=1}^{T}\frac{P^{t}}{P^{t-1}}W^{t}\).

Reinforcement learning is the method that can rightly simulate this process, which can be expressed mathematically as the Markov decision process (MDP). In the RL method, an agent has states and actions and it interacts with the environment under the current state, obtains a reward from the environmental feedback after performing an action, and then attains the next state. In financial portfolios, the state is used to describe a set of characteristics of the current assets; in general, the states are the market quotes. Actions are referred to as interactions between the agent and the transaction environment. Notably, portfolio management includes the buying, holding, and selling of operations. The transfer of the agent state is accompanied by feedback reward \(R_{t}\). When an MDP is complete, the returns of all the immediate rewards are together referred to as cumulative returns (CuRs), denoted as \(G_{t}\), where \(\gamma \) is the discount factor.

$$\begin{aligned} G_{t}=R_{t+1}+\gamma R_{t+2} + \gamma ^{2} R_{t+3} + \dots + \gamma ^{k} R_{t+k+1} = R_{t+1}+\gamma G_{t+1} \end{aligned}$$
(1)

\(\pi (a|s)\) is used to denote a policy in RL, which refers to the probability that an action a should be taken for a given state s.

$$\begin{aligned} \pi (a|s)=P(A_{t}=a|S_{t}=s) \end{aligned}$$
(2)

The state-value function \(v_{\pi }(s)\) was the expectation of \(G_{t}\) from the current state to the end state. Bellman [49] provided the relationship by solving a dynamic programming problem. Thus, we obtained the Bellman state value function \(v_{\pi }(s)\) as follows:

$$\begin{aligned} v_{\pi }(s)=E_{\pi }(G_{t}|S_{t}=s) \end{aligned}$$
(3)

\(q_{\pi }(s,a)\) was the state-action value function, which was the expectation of the state value of the next state \(s'\) after taking action a. If taking an action in the current state resulted in a greater return, then the generated action had a higher value.

$$\begin{aligned} q_{\pi }(s,a) = \sum _{s',r}p(s',r|s,a)(r+\gamma v_{\pi }(s'))=r+\gamma \sum _{s',r}p(s'r|s,a)v_{\pi }(s') \end{aligned}$$
(4)

An optimization algorithm was used to obtain the optimal state value function \(v_{*}(s)\) and optimal state-action value function \(q_{*}(s,a)\).

$$\begin{aligned}{} & {} v_{*}(s)=max_{a}(q_{\pi }(s,a))=max_{a}(\sum _{s',r}p(s',r|s,a)(r+\gamma v_{\pi }(s')) \end{aligned}$$
(5)
$$\begin{aligned} q_{*}(s,a)= & {} \sum _{s',r}p(s',r|s,a)(r+\gamma v_{*}(s')) \nonumber \\= & {} r+\gamma \sum _{s',r}p(s',r|s,a)(r+\gamma max_{a'}q_{*}(s',a')) \end{aligned}$$
(6)

The Q-learning algorithm [50] has been used to solve the optimal state-action value function and the optimal policy in the case of limited samples. This algorithm updates the states and Q values, and the target policy is used to obtain the action under the maximum Q-value of the next state. Moreover, the Q-learning algorithm adopted the \(\varepsilon \)-greedy policy; i.e., to generate the next action with a certain probability \(\varepsilon \); thus, maintaining the possibility of exploring the learning process. Therefore, the Q-learning algorithm updates its Q-value as follows:

$$\begin{aligned} q_{\pi }(S_{t},A_{t}) = q_{\pi }(S_{t},A_{t}) + \alpha [R_{t+1} + \gamma max_{a'}(q_{\pi }(S_{t+1},a'))-q_{\pi }(S_{t},A_{t})] \end{aligned}$$
(7)

The DQN algorithm is a DRL method based on Q-learning and uses a DNN to approximate the action-state value function \(q_{\pi }(S_{t},A_{t})\), which can be expressed as follows:

$$\begin{aligned} q_{\pi }(S_{t},A_{t}) \approx Q(s, a; \theta ) \end{aligned}$$
(8)

Here, \(Q(s,a;\theta )\) denotes a neural network with parameter \(\theta \), and the network output is the Q-value of each action. The DQN algorithm has two neural networks: the main network \(Q(\theta )\), and the target network \(\hat{Q}(\theta ^{*})\), in which \(\theta \) and \(\theta ^{*}\) indicate the parameters of these two networks. The main network can be updated by using the loss function as follows.

$$\begin{aligned} L = (R + \gamma max_{a}\hat{Q}(s',a;\theta ^{*}) - Q(s, a; \theta ))^{2} \end{aligned}$$
(9)

3 CMPS Model

3.1 CMPS Framework

In this section, we present the CMPS model based on the DRL by setting the special states, actions, and rewards according to the portfolio environment. In the CMPS model, we designed a 3-agents framework to handle different data, that is, the stock market quotes, the financial index, and the correlation of the stocks. The theory hidden in this design is as follows. The stock market quotes are the basic information that may show the future short-term trend of a stock; the financial index can reflect the global change of the stock market; and the correlation of the stocks will represent the features of how one stock impact another one; all the features can be fused to give an exact guide to the agents. This 3-agents design can comprehensively consider the micro and macro economical trends, so that it can get a better policy to achieve higher returns. Each agent is used to implement the MDP by interacting between the agents and transaction systems; moreover, the model was trained with interactive experiences to maximize the cumulative reward. Finally, we conducted a back-test using the trained model and evaluated the comprehensive performance of our models based on the standard metrics. As shown in Fig. 2, the shared states are generated by fusing all the three kinds of data and fed into the agents; then, new states and rewards were achieved by interacting with the environment for each step, and the portfolio weights, that is, the new actions were generated by the updated agent policy. This procedure continues until the model converges.

Fig. 2
figure 2

CMPS framework

3.1.1 States, Actions, and Rewards

As shown in Fig. 3, the state \(s_{t}\) of the CMPS model is represented as \(s_{t}=(X_{t},F_{t},C_{t})\). In these states, \(X_{t}\) represents the vector of the market quotes within the time range \([t-l,t]\) with the shape of \((M+1,l,N_{p})\), where M denotes the number of stock assets. Because the SSE index was added to represent the stock characteristics, the total number of stocks was \(M+1\); l represents the size of the time window, and \(N_{p}\) is the size of the market quote, including the opening, lowest, highest, and closing prices, as well as the yield. Note that l can be adjusted; theoretically, the larger the l, the more temporal features can be extracted, also has a larger computing cost. \(F_{t}\) is the financial index tensor of the stock portfolio in the time range \([t-l,t]\), and the shape size is represented as \((M,l,N_{f})\), where \(N_{f}\) denotes the characteristic number of the financial index. \(C_{t}\) represents the correlation tensor of the stocks in the time range \([t-l,t]\) with the shape \((M+1,M+1)\); here, it is expressed using a yield covariance matrix.

Fig. 3
figure 3

CMPS model state tensor

The actions of the CMPS model indicates how the agent performs stock trading according to its current state, and they were discrete and only involved the buying, holding, and selling of stocks, referring to the discrete values of -1,0 and 1, respectively. As the CMPS model was aimed at the stock portfolio, the action space of a single agent included the actions of all the assets in the stock portfolio, whereas the overall CMPS actions covered those of all the agents. Thus, the overall actions \(a_{t}\) was expressed as follows.

$$\begin{aligned}{} & {} a_{t}=\{a_{t}^{1},a_{t}^{2},\cdots ,a_{t}^{K}\} \end{aligned}$$
(10)
$$\begin{aligned}{} & {} a_{t}^{i}=\{a_{t}^{i,1},a_{t}^{i,2},\cdots ,a_{t}^{i,M}\} \end{aligned}$$
(11)
$$\begin{aligned}{} & {} a_{t}^{i,c}=\{-1,0,1\} \end{aligned}$$
(12)

K is the number of agents, M is the number of stocks. Thus, we can use \(a_{t}^{i}\) to represent the actions of agent i on the stock portfolio at time t, and use action \(a_{t}^{i,c}\) to represent the actions of agent i on stock c in the stock portfolio at time t.

The reward describes the score received after the agent takes an action for the current state, and the reward value of a single stock is represented by the product of the difference between the closing price at times t and \(t+1\) and the current action. The rewards of all the stocks are summed for a single agent, and the rewards for all the agents \(r_{t}\) is the reward of CMPS, which is denoted as follows.

$$\begin{aligned}{} & {} \quad r_{t}=\{r_{t}^{1},r_{t}^{2},\cdots ,r_{t}^{K}\} \end{aligned}$$
(13)
$$\begin{aligned}{} & {} \quad r_{t}^{i}=\sum _{c=1}^{M}r_{t}^{i,c} \end{aligned}$$
(14)
$$\begin{aligned}{} & {} \quad r_{t}^{i,c}=a_{t}^{i,c} \times R_{t}^{i,c} \end{aligned}$$
(15)

\(r_{t}^{i}\) indicates the reward of agent i at time t, and \(r_{t}^{i,c}\) is the reward of agent i to stock c at time t, in which \(R_{t}^{i,c}\) represents the daily return of stock c between times t and \(t+1\).

If we consider risk-free assets for optimizing portfolios, the reward of an agent can be divided into two parts: the price change and the risk-free asset return. The overall reward \(r_{t}\) can be optimized as follows.

$$\begin{aligned} r_{t}^{i}=(1-Rf_{w}) \times \sum _{c=1}^{M}r_{t}^{i,c}+Rf_{w}\times r_{f}\times \frac{100}{252} \end{aligned}$$
(16)

\(Rf_{w}\) is the weight of the risk-free assets generated by the agents and \(r_{f}\) refers to the returns of risk-free assets. According to the Markowitz mean-variance theory, the interest rate of treasury bonds is used to approximate \(r_{f}\); i.e., \(Rf_{w}\times r_{f}\times \frac{100}{252}\). Here, 100 refers to a debt value; i.e., 100 RMB yuan, and 252 is the total number of stock trading dates in a year; thus, \(r_{f}/252\) is the daily yield of the risk-free assets, and \(100 \times r_{f}/252\) is the daily return invested in a bond. Here, we did not consider trade friction problems for convenience.

3.1.2 Agent Network Structure

The policy \(\pi _{\theta }\) of our CMPS model plays a key role in the generation of trading actions; i.e., \(a_{t}=\pi _{\theta }(s_{t})\), where \(\theta \) is a set of policy parameters. In DRL, trading actions are learned through a neural network, and the policy is optimized by the DQN algorithm so that the cumulative discount reward is maximized.

Because the CMPS contains agents with different functions and input state modules, the network structures should be designed. According to the differences in the states, the CMPS has three agents: the stock market agent, stock financial index agent, and stock correlation agent. The network in these three agents is composed of a fully-connected layer, normalized layer, and convolutional layer. Nevertheless, the parameters differ according to the specific network structure. Figure 4 shows the overall structure of the network.

Fig. 4
figure 4

CMPS model agent network architecture

In the stock market and financial index agents, the convolutional layer extracts the features assisted by the fully-connected and normalized layers so that a score tensor is achieved. In the stock correlation agent, because the correlation of stocks remains unchanged, we regarded this agent as a static one that only extracted the features of the correlations and fed them to the other two agents. All the generated score tensors were input into the self-attention layer to comprehensively generate an overall score \(\rho \). Here, the self-attention layer improved the performance of feature extraction and avoided forgetting remote features; thus, the agents captured useful features from different weight vectors.

We also consider the problem of risk prevention in our CMPS model to generate weights \(Rf_{w}\) for risk-free assets. Our consideration was inspired by mutual fund separation [51]; i.e., the optimal portfolio should only be found from the capital market line, which was derived from recombining risk-free and risk assets. Because it is not feasible to obtain the portfolio in reality, the CMPS attempted to approximate the portfolio based on the current stocks and allocated the capital between the risk-free and risk assets. Weight \(Rf_{w}\) was regarded as a value selected from the continuous action space [0, 1], which was in accordance with the normal distribution \(N(\mu ,\sigma ^{2})\). After the self-attention layers generated \(Rf_{w}\), it was standardized to be within the range [0, 1].

The detailed parameters are as follows. The convolution operation parameters in the stock quote agent are \((in\_channels=M,out\_channels=M-1,kernel\_size=(1,1))\), and the convolution operation parameters in the stock correlation agent network are \((in\_channels=M+1,out\_channels=M,kernel\_size=(1,1))\). As mentioned before, M refers to the number of stocks in the portfolio.

When the three agents generated the scores and passed them to a self-attention layer, an overall score \(\rho \) and the weight of risk-free assets were output when risk-free assets were considered. The operation parameters used in the self-attention layer were as following: \((dim\_in = 3,dim\_k = 16,dim\_v = 16)\).

Once the networks were constructed, better actions were identified based on the scoring tensor of the overall output; i.e.,

$$\begin{aligned} a_{t}=argmax(\rho _{t})-1 \end{aligned}$$
(17)

Here, the score of the three discrete actions at time t is denoted by \(\rho _{t}\), which took either of the three values: 0, 1, and 2.

3.2 Model Training

During training, CMPS aimed to produce experiences \(e_{t}\) from the training dataset and store them sequentially in the empirical pool. CMPS adopted the DQN algorithm to optimize the policy; however, it needed to store the experiences to implement the off-policy mechanism. Generally, experience data can be represented as \(e_{t}=<s_{t},a_{t},s_{t+1},r_{t}>\), where \(s_{t}\) is the state of time t, \(a_{t}\) is the action of time t, \(s_{t+1}\) is the state of time \(t+1\) (i.e., the next state), and \(r_t\) is the reward after action \(a_{t}\) is taken. The interaction between the agent and environment led to experiences that aided in the completion of MDP, which in turn produced multiple experiences that were then input into the empirical pool. The experiences were then sampled from the empirical pool to perform the training process to optimize the agent network parameters.

In our CMPS model, we constructed an empirical pool in a different manner. \(s_{t}\) was divided into three categories: \(s_{t}^{p}\), \(s_{t}^{f}\), and \(s_{t}^{c}\), which represent the stock market quote state, stock financial index state, and stock correlation state, respectively. Consequently, a single experience can be expressed as follows.

$$\begin{aligned} e_{t}=<s_{t}^{p},s_{t}^{f},s_{t}^{c},a_{t}, s_{t+1}^{p},s_{t+1}^{f},s_{t+1}^{c},r_{t}> \end{aligned}$$
(18)

If all the states were placed together, each agent obtained the specified state when calculating the loss function; thus, the corresponding states had be decomposed for each sampling batch, which may have increased the computation cost. To sample the experiences in our model, we stored the states in a physically-decentralized manner. The states in the different models from each experience were stored in a single list to enable them to be sampled quickly for a massive MDP.

When the number of experiences became equal to the batch size \(\varpi \), CMPS sampled the data from the empirical pool to conduct training. As the number exceeded the pool size, the new experiences iteratively covered the original ones from the beginning of the pool.

In CMPS, the loss value was calculated based on the constructed loss function. Because the model contained three different agents, the different losses were first calculated, which were then averaged to obtain the final loss, and the network parameters of each agent were updated with a gradient descent manner. Specifically, the loss of the i-th agent can be defined as follows.

$$\begin{aligned} Loss_{i}=[Q(s_{i},a_{i};\theta _{i})-r_{i}-\gamma max_{a'_{i}}\hat{Q}(s'_{i},a'_{i};\theta _{i}^{*})]^{2} \end{aligned}$$
(19)

The loss value of the i-th agent required the output of the main network Q and the target network \(\hat{Q}\), i.e., \(Q(s_{i},a_{i};\theta _{i})\) and \(\hat{Q}(s'_{i},a'_{i};\theta _{i}^{*})\), where \(\theta _{i}\) and \(\theta _{i}^{*}\) are the parameters of the main and target networks, respectively.

After the loss value Loss was calculated, the final loss value FLoss was obtained using arithmetic average.

$$\begin{aligned} FLoss=\frac{\sum _{i=1}^{K}Loss_{i}}{K} \end{aligned}$$
(20)

Algorithm 1 shows the training algorithm procedure of CMPS. As can be seen, there are two networks Q and \(\hat{Q}\) to represent the training policy and the target policy. In each episode, we input the state at time t into Q and output the action, then get the reward and the next state. They are stored in the empirical pool to train the the Q with the loos function FLoss.

Algorithm 1
figure a

CMPS model training algorithm

3.3 Computing Cost Analysis

In the context of CMPS, each agent is represented by a deep neural network. This network typically comprises various layers, including fully connected layers, batch normalization layers, convolutional layers, and a self-attention layer. The computational cost of training such a network is determined by the collective costs associated with each of these layers, which in turn depend on factors like the input and output dimensions, the kernel size in the convolutional layer, and the feature dimension in the self-attention layer.

Given i agents in the system, each undergoing t epochs of training, with each epoch consisting of p iterations, and assuming that the cost of each iteration in training is O(n), we can represent the total computational cost of training the CMPS system as \(O(n \times i \times t \times p)\). This estimation prompts further exploration as follows.

Firstly, the computational load is cumulatively contributed to by each network. Since these three agents are trained concurrently, the training time can be significantly reduced. However, the computing cost will be proportional to the number of agents. Secondly, each epoch represents a complete pass through the training dataset, and increasing the number of epochs can enhance learning and convergence, albeit at the cost of additional computational time. Thirdly, iterations per epoch indicate the level of training detail within each epoch. More iterations can offer more precise adjustments to the parameters, yet they also raise the overall computational demand.

4 Experimental Studies

4.1 Data Pre-processing

We collected the stock market quotes, financial index data of 34 SSE stocks, and the SSE index from the Resset database. The data were collected from April 30, 2009 to March 01, 2021. The daily stock quotes included the opening, highest, lowest, and closing prices, as well as the yield of stocks used by the first agent.

We selected 11 financial indexes, including BasicEPS, NAPS, ToloperevPS, OpeprfPS, CFPS, AvgROE, ROA, Netprfrt, NprTOR, EPSgrrt, and Netprfgrrt, which referred to the earnings per share, net assets per share (yuan/share), total operating income per share (yuan/share), operating profit per share (yuan/share), net cash flow per share (yuan/share), average return on equity (%), net asset interest rate (%), net sales interest rate (%), net profit/operating revenue, earnings per share growth rate (%), and net profit growth rate (%), respectively. The first five indexes reflected the earnings, growth, and liquidity of the listed stock companies, and the next four financial and the last two denoted the macro profitability and growth ability of the stock. All these indices were used in the second agent.

Generally, different companies release quarterly or annual reports at different times, four times a year. Clearly, stock financial index data are sparse compared to market data. To address this problem, we filled the financial index data of all the trading days between the current release date and the next release date with the indexes of the current financial report to ensure that the model obtained the features of the financial indexes according to the latest financial report data.

According to the financial theory, optimizing a stock portfolio requires the average, variance, and covariance of the return rate, which is achieved by constructing a covariance matrix of assets. Thus, the stock correlation data are composed of the covariance matrix of the return rate of the stocks and the SSE index in the past 1 day. Our model added the SSE index to reflect the correlation between the individual stocks and the overall market. To evaluate the return and risk prevention of our model, we built two datasets: the bull and bear datasets. The bull dataset was compiled using the data collected from April 30, 2009 to March 01, 2021, which reflected an overall up-trend of the markets. Whereas, the data in the bear dataset were collected between April 30, 2009 to September 01, 2015, during which there was a significant downward trend from September 01, 2014 to September 01, 2015, and the data in this period were used to perform the back-test to evaluate the risk-prevention ability of our models.

4.2 Experimental Configuration

4.2.1 Benchmark Model

We selected the stock quantitative trading models, FINRL-PPO, FINRL-DDPG, FINRL-TD3, and FINRL-SAC proposed by Liu et al. [27] along with the DRL algorithms. All these models are single-agent models and use stock market quotes and technical indicators to perform training; i.e., no financial indexes and stock correlation data were used.

We selected the MAPS model proposed by Lee et al. [36] for 2020. As in the SOTA model, MAPS is a portfolio-management system used for collaborative multi-agent RL and is based on the DQN algorithm. The differences between MAPS and CMPS are as follows: First, similar to FINRL, MAPS only uses the stock market quotes as the training data, and CMPS uses additional data such as financial indexes and stock correlation data. Second, MAPS uses several agents to build local models to represent independent investors, which enriches the investigation directions, and CMPS uses different agents to learn different types of knowledge from different types of data; i.e., only one agent is used to represent the investor, and the other agents are used to assist the investor with macro guidance. Third, MAPS uses a mechanism to ensure a balance between the local investors and global investor for risk management, whereas CMPS considers risk-free assets to prevent risks. In our study, we adopted two MAPS models, MAPS@3 and MAPS@8, with three and eight agents, respectively. Note that even though the SARL or the DeepTrader is the better model to evaluate our CMPS model, the datasets are too different to be used. As presented before, SAR employed the financial news, and DeepTrader employed the market sentiment index, which can only be used the special stock prices, rather in ours.

We also conducted ablation studies to evaluate the impact of the single technique used in our model, i.e., CMPS without financial index agents and stock correlation agents, denoted as CMPS-NF and CMPS-NC, respectively.

Further, we added risk-free assets to obtain CMPS-RF for the automatic selection of risk-free assets and risky stock portfolios for portfolio management.

4.2.2 Evaluation Metrics

We evaluated our models based on the following three aspects. First, we used the cumulative return (CuR) and the annualized return ratio (ARR) to evaluate the return ability. Second, we used the annualized volatility (AVol) and the maximum draw down (MDD) to evaluate risk prevention. Third, we selected the annualized Sharpe ratio (ASR) and Calmar ratio (CR) to comprehensively evaluate the return and risk.

(1) CuR

The CuR was calculated using the final total portfolio value \(p_{t}\) and the initial portfolio value \(p_{0}\), and the ratio of CuR reflected the overall return of the model. A higher CuR is preferable.

$$\begin{aligned} R_{CuR}=\left( \frac{p_{t}}{p_{0}}-1\right) \times 100\% \end{aligned}$$
(21)

(2) ARR

The CuR was converted into the geometric average return ratio per year, namely the ARR. Here, n refers to the number of days the back test was performed. However, the higher the ARR, the better the model.

$$\begin{aligned} R_{ARR}=((1+R_{CuR})^{\frac{252}{n}}-1)\times 100\% \end{aligned}$$
(22)

(3) AVol

The volatility of a portfolio refers to the standard deviation of the return rate during the back-test period; whereas, AVol refers to the volatility in years. AVol measured risk; i.e., the higher the AVol, the higher the model’s risk.

$$\begin{aligned} \sigma _{AVol}= \sqrt{\frac{252}{n}} \times \sigma \times 100\% \end{aligned}$$
(23)

Here, \(\sigma \) refers to the standard deviation of the return rate of the portfolio during the back test period, and n indicates the number of days that the back test was performed.

(4) MDD

MDD was used to assess the worst condition that a strategy might generate. We describe the maximum loss of the portfolio from the peak to bottom. MDD highlights the bias of the strategy. In general, the aim is to achieve a smaller MDD, which can be expressed as follows.

$$\begin{aligned} MDD = max_{\tau > t}\frac{p_{t}-p_{\tau }}{p_{t}} \times 100\% \end{aligned}$$
(24)

(5) ASR

ASR was used to evaluate the excess return ability under the unit volatility risk. It is an evaluation metric that considers strategic risks and returns. Here, \(R_{f}\) is the return on risk-free assets. A higher ASR indicates a strong model that prevents risks with a high return.

$$\begin{aligned} ASR=\frac{E(R_{ARR}-R_{f})}{\sigma _{AVol}} \end{aligned}$$
(25)

(6) CR

The CR and ASR are similar. CR measures the risk-return ratio of the portfolio, whereas the CR is expressed by the maximum MDD as the risk. Similar to the ASR, a higher CR is always better.

$$\begin{aligned} CR=\frac{E(R_{ARR}-R_{f})}{MDD} \end{aligned}$$
(26)

4.2.3 Parameters

The cost of stock trading consisted of commissions, stamp duties, and so on. Here, we set the transaction cost of the stocks \(\mu =0.001\); i.e., the charge to be 0.1% of the transaction amount. The learning rate, empirical pool size, and sampling batch size of CMPS were 0.0001, 100,000, and 64. Table 2 specifies the experimental parameters.

Table 2 Experimental parameters

4.3 Experimental Results

4.3.1 Performance of CMPS Over the Bull Market

The models were first trained using the bull dataset. Table 3 presents the test results. As shown, CMPS has better CuR and ARR as compared to the other models. The ARR of CMPS reached 62.683%, which was higher than that of FINRL and also higher than MAPS by approximately 4%, indicating CMPS has strong ability to obtain profits. Second, the ASR of CMPS was higher than that of both FINRL and MAPS, indicating that the return obtained by CMPS was better when asset volatility was considered as means to measure the risk. Third, CMPS performed better on CR than MAPS, showing an improvement of 0.5; this indicated that CMPS can obtain excellent returns when MDD is considered.

Table 3 Back-test results on the bull dataset

In general, CMPS improved the return ability of the stock asset portfolio when considering the risks. Figure 5 shows the CuR trend in the test range for each model.

Fig. 5
figure 5

CuR of model on bull dataset

Table 3 also shows when the CMPS was excluded from the financial index or stock correlation; i.e., the CMPS-NF or CMPS-NC. Both the models have reductions in CuR and ARR by 3% as compared to CMPS, indicating that the introduction of financial indexes and stock correlation increased the model returns. Moreover, the ASR of CMPS-NF and CMPS-NC decreased by 0.1 compared to that for CMPS, and the CR decreased by 0.7, indicating that adding a financial index and stock correlation improved the ability of CMPS to obtain returns when the risk of volatility and draw down were considered.

Note that after the addition of risk-free assets, CMPS-RF reduced the CuR and ARR by 30%. Nevertheless, CMPS-RF showed an advantage in terms of AVol, which was approximately half of that in the case of CMPS; this led to a higher ASP than that in CMPS, though the ARR was worse. This showed that CMPS-RF obtained more stable profit. Furthermore, the CR of CMPS-RF was higher than that of CMPS and the other models; this showed that CMPS-RF is better than CMPS when measuring the downside risk. In general, CMPS-RF avoided certain market risks by allocating wealth between risk and risk-free assets. When market volatility was high, CMPS-RF selected more investments in the risk-free assets. However, this did not imply that CMPS-RF is conservative. As an example, we set the ARR of the risk-free assets to 3%, and if all the capital was invested in risk-free assets, the ARR did not exceed 3%, but that of CMPS-RF reached 34.710%, and CMPS-RF was able to flexibly allocate the weight of risk-free assets at different periods to avoid risk. As shown in Fig. 5, the CuR of CMPS-RF tended to exhibit a stable upward trend during the back-test phase.

To further illustrate the advantages of CMPS, we computed the CuR differences between the CMPS with MAPS@3, CMPS-NF, and CMPS-NC, to evaluate CMPS based on the cumulative return advantage (CuRA). As shown in Fig. 6, the CuRA of CMPS in the back-test phase showed an upward trend. After single-step trading, the CuRA of CMPS reached at least 2.5% of that of the other models. Moreover, we can see that CMPS-NF had a greater advantage over CMPS-NC, which showed that the stock correlation contributed more information than the financial indices in terms of returns.

Fig. 6
figure 6

CuRA of CMPS on the bull dataset

4.3.2 Performance of CMPS over the Bear Market

We further trained all the models using the bear dataset. In this regard, we selected the back-test phase ranging from 2014-09-01 to 2015-09-01, because the Chinese stock market almost crashed during this period, and the SSE index also fluctuated greatly. Table 4 lists the test results.

Table 4 Back-test results on the bear dataset

We observed that in the case of large fluctuations, CMPS achieved the best CuR and ARR, better than those of the FINRL and MAPS. Additionally, CMPS showed better ASR and CR than MAPS. CMPS-RF achieved the best performance on the four metrics when risk-free assets were considered. The Avol and MDD were nearly half of those in the case of CMPS and further led to about 0.2 more than that of ASR and CR. Evidently, CMPS-RF showed a strong risk prevention ability regardless of the market conditions. Note that although CMPS and MAPS achieved a 52% more ARR, the Avol and MDD were too large to afford the investors because the loss almost reached 50%. Thus, CMPS-RF is a better choice for such conditions. Figure 7 shows CuR of all the models. As can be seen, some of the models attained profits in the beginning, but the CuR began to decrease from June 2015 when CMPS was used. However, CMPS-RF showed a greater performance deterioration as compared to the others.

Fig. 7
figure 7

CuR of model on bear dataset

We investigated the CuRAs of CMPS when using MAPS@3, CMPS-NF, and CMPS-NC. As shown in Fig. 8, although CMPS yielded the best return, the CuR of CMPS was lower than those of the other two models in most of the time, and the one-day reduction can be more than 2%. When the market showed a downward trend, this decrease in the CuR reduced and the CuR thereafter increased. We considered the main reason for this to be that when market expectations exceeded the exact value of the assets, a big bubble in the market was generated; in such a case, CMPS considered the financial indexes and the stock correlation had a relatively lower market expectation. In other words, CMPS was more conservative when facing a market bubble, and this led to a reduction in the CuR when the bubble expanded, as compared to the cases of MAPS@3, CMPS-NF, and CMPS-NC. When the bubble burst, this disadvantage was reduced or an advantage was instead noted. This showed that CMPS acts decisively when bubbles are generated or burst and can therefore better identify market bubbles and avoid asset shrinkage in compare to the other there models.

Fig. 8
figure 8

CuRA of CMPS on bear dataset

4.3.3 Transaction Cost Sensitivity of CMPS

The transaction cost is a key factor affecting the stock market and is adopted to prevent massive high-frequency trading. Generally, the rise in transaction costs leads to a decrease in trading, and the turnover rate in the secondary market consequently declines. If other conditions in the market remain unchanged, the CuR is bound to decline.

We studied the impact of transaction cost \(\mu \) on CMPS. We used CMPS-\(\mu \) to represent CMPS with a transaction cost of \(\mu \). As shown in Table 5, \(\mu \) was set to 0, 0.001, 0.002, and 0.003. An increase in \(\mu \) caused a decline in the CuR in CMPS, particularly when \(\mu \) changed from 0.2 to 0.3% over the bull dataset; i.e., there was nearly a 6% reduction. The experimental results showed that increasing \(\mu \) prevented market speculation. As can be seen, when \(\mu \) changed from 0 to 0.1% to 0.2%, the AVol of CMPS over the bull dataset was similar to that of an increase, but after the transaction costs rose to 0.3%, the AVol decreased. A similar trend was observed for the bear dataset. Thus, increasing \(\mu \) reduced the risk, although the return reduced. Nevertheless, both the ASR and CR decreased with an increase in \(\mu \); this implied that CMPS model also abides by the financial rules. Plus, the transaction cost has more impact on the CMPS model under the downside situations.

Table 5 Transaction cost sensitivity of CMPS

5 Conclusions

In this study, we examined stock portfolios using DRL. Most of studies have used real-time stock quotes as the training data, which will lead to a probability bias of the state transfer in MDP because insufficient features are extracted in such cases. We proposed a model called CMPS and collected heterogeneous data, including those of stock quotes, stock and financial indices, and stock correlation between stocks. We fused these data into our model based on a three-agents structure to extract different features and used a self-attention mechanism to comprehensively output the scores of the stocks to achieve an accurate probability to select an action. We then introduced risk-free assets based on the financial theory of mutual fund separation and proposed the risk-prevention model CMPS-RF, wherein the risk-free and risky assets were balanced using a computed weight. Our experimental results showed that CMPS had the best returns; that is, the CuR increased by more than 6.8% in comparison to the SOTA studies. Further, CMPS showed advantages in the identification of market bubbles and avoiding of asset shrinkage, which is caused by the bursting of market bubbles. CMPS-RF, which considers risk-free assets, was found to be more robust and achieved the highest annualized Sharpe ratio and Calmar ratio as well as the lowest annualized volatility and maximum draw down.