1 Introduction

With the aim of maximizing return, portfolio optimization is a decision-making process that continuously allocates funds into various financial derivatives [1]. A key challenge for this is carefully balancing the multidimensional information and sometimes conflicting objectives of various decision processes in a noisy financial environment. Thus, many trading algorithms are expected to operate on this fine granularity. Traditionally, many machine learning and deep learning methods have been used to predict future price trends and fluctuations [2,3,4]. Nevertheless, one of the inherent difficulties of these price-prediction-based algorithms is to forecast future stock behavior with a high accuracy level. In fact, given the Efficient Market Hypothesis [5], it is nearly impossible for any trader to hypothetically outperform the market and consistently produce risk-adjusted excess returns (alpha).

Lately, deep Reinforcement Learning (RL) has attracted much attention due to its remarkable achievements in playing video games [6] and board games [7]. In RL, an agent’s current behavior is closely related to its future rewards through multiple interactions with its environment. Such behavior allows the agent to gradually adopt an action that can maximize rewards and minimize penalties without predicting future states. This learning process is natural in biological life forms, and it has also been shown to be highly effective for artificial agents [8]. As a matter of fact, incorporating deep neural networks into the reward-penalties learning process gives the deep RL an inherent edge in many different applications.

There have been many successful attempts to implement model-free deep RL algorithms on algorithmic trading problems. This includes the value-based RL and policy-based RL algorithms. By discretizing market actions, Lucarelli and Borrotti [9] propose a value-based RL which applies recent adaptations of Q-learning, e.g., Deep Q-Network [6], Double-DQN [10], and Dueling-DQN [11], to portfolio selection problems. Although discretizing market action is feasible, with the growing number of assets as the input and higher dimensional action as the output, it becomes increasingly difficult for the neural network to handle.

In order to accommodate more assets, Jiang et al. [12] introduce an RL framework based on the actor-critic Deterministic Policy Gradient Algorithm [13, 14], a technique that is combined with value-based RL and policy-based RL, that can continuously output actions through the policy function approximated by a neural network. However, states in [12] still depend on historical stock prices with only three features, i.e., highest, lowest, and close prices of stocks. This is a relatively simplified assumption since the stock market operates far beyond the scope of these three features and rather independently of past performance. Indeed, the external environment, including the global economy and companies themselves, has a significant impact on the stock market. Moreover, the market strategy in [12] is deterministic, thus its trading agent is highly conservative and lacks the ability to explore and gain alpha returns.

In addressing these problems, our contributions are four-fold:

  • First, we benchmark several classic RL algorithms, Deep Deterministic Policy Gradient (DDPG) [14], Twin-Delayed DDPG (TD3) [15], and Soft Actor-Critic (SAC) [16, 17], in the continuous portfolio optimization action space.

  • Second, to imitate the uncertainty in the real financial market, we propose a novel state-of-the-art stochastic reinforcement learning framework inspired by Soft Actor-Critic (SAC) and Quantile-Regression DQN (QR-DQN) [18, 19], namely S tochastic P olicy with D istributional Q-Network (SPDQ) for the dynamic management of the stock market portfolio. Importantly, we create a novel structure containing a stochastic policy, modeled by Gaussian Mixtures, and a distributional critic modeled by quantile numbers.

  • Third, we enrich the state space by adding additional qualitative financial factors. Additionally, we reformulate the one-step reward by adding a risk term to the simple return.

  • Fourth, we provide the interpretation of the model strategy, and an ablation study for different hyperparameters to better serve the diverse input states as well as to assess the robustness of our proposed algorithm.

The rest of this paper is organized as follows. Section 2 provides a comprehensive review of previous model-free reinforcement algorithms and their applications in portfolio optimization. Section 3 gives a mathematical definition of the portfolio optimization problem. Section 4 introduces a basic preliminary of reinforcement learning and dives into the proposed stochastic RL algorithms for the presented portfolio optimization problem. Section 5 details experimental procedures and results corresponding to the proposed algorithms. In Section 6, the conclusions for this research are given.

2 Related works

2.1 Stated-of-the-art RL algorithms

Previous works have used Deep Deterministic Policy Gradient (DDPG) [14] and Twin Delayed Deep Deterministic Policy Gradient (TD3) [15] to generate deterministic continuous action space. On the contrary, Soft Actor-Critic [16, 17] learns a stochastic action by maximum entropy reinforcement learning, including a temperature hyperparameter used to control the importance of return and entropy. Apart from stochastic action space, C51 [18] and QR-DQN [19] learn a value distribution and highlight the ways in which the value distribution impacts learning in the approximate setting. Given the uncertainty of the financial market, the built-in stochastic settings in SAC [16, 17] have an edge over the Gaussian Noise added to deterministic DDPG [14] and TD3 [15] policy when it comes to exploration. Consequently, unlike fixed rewards in video games, estimating a value distribution of the cumulative market return, rather than averaging its randomness, becomes increasingly important.

2.2 DRL applications in stock trading

Current mainstream RL uses an accumulated discounted reward as the objective function. Among the stock trading research on this discounted reward settings, Liang et al. [20] propose two adapted versions of policy-based RL algorithms based on Proximal Policy Optimization (PPO) [21] and Policy Gradient (PG) for portfolio management on China’s stock market. Lucarelli and Borrotti [9] implement the Deep Q-Network, Double DQN, and Dueling DQN, which all are value-based RL algorithms. For cryptocurrency portfolios, Jiang et al. [12] apply Deterministic Policy Gradient (DPG), containing both policy network and value network to solve portfolio optimization problems. Wang et al. [22] use a hierarchy structure containing a high-level RL with an Entropy Bonus to control the portfolio weights and a low-level RL to control selling price and quantities within one day. Fang et al. [23] implement an Oracle by distilling actions trained from perfect future stock information (Policy Distillation) [24] to guide the agent making decisions under imperfect previous stock information. For model-based approaches, Yu et al. [25] incorporate an Infused Prediction Module (IPM) into the original actor-critic style DDPG algorithm so the transition states can be predicted by IPM. All the proposed RL algorithms claim to be profitable and outperform classical algorithms in terms of the Sharpe value and geometric mean return. Importantly, the input states of these RL algorithms are limited to the open, close, high, and low price vectors.

2.2.1 Risk-aware DRL in stock trading

Apart from considering the expected accumulated discounted reward as the objective function, there is another research track in the Reinforcement Learning community that uses Conditional Value at Risk (CVaR/VaR) as the objective function which emphasizes the AI safety and risk awareness. Theoretically, Chow et al. [26] propose the CVaR MDP in which the standard risk-neutral expectation is replaced by a risk-sensitive Conditional-Value-at-Risk (CVaR) objective. Stanko and Macek [27] introduce CVaR Q-learning, a sampling version of CVaR Value Iteration [26] based on the distributional policy improvement algorithm. In the financial market, modern portfolio theory (MPT), or mean-variance analysis [28], Value-at-Risk (VaR) [29] and Conditional-Value-at-Risk (CVaR) [30] are all widely used in risk management models to reduce the maximum possible loss of a financial product under price fluctuations. However, very little research connects these risk models with reinforcement learning and successfully applies it to the financial market.

2.2.2 Function approximator in stock trading

Other literature of stock trading focuses on designing the customized topology of the neural network for financial features. Initially, many researchers use customized neural networks on stock price prediction tasks. For instance, Chen et al. [31] incorporate a graphical convolutional network based on quantitative data, and Ding et al. [32] embed business events according to knowledge graph information to predict stock prices. Later, inspired by the prediction tasks [31, 32], AlphaStock [33] creates a Transformer-based Cross-Asset Attention Network (CAAN) that uses multiple stock features to approximate the functions in the RL algorithms. Moreover, Wu et al. [34] implement Gated Recurrent Unit (GRU) on Deep Q-learning and Policy Gradient Algorithms to extract informative financial features.

3 Problem statement

In portfolio optimization, we would like to continuously allocating capitals into a number of financial assets with the aim of maximizing the cumulative return. For an automatic trading agent, the process of obtaining daily returns through enhancing or reducing portfolio positions can be seen as a finite Markov Decision Process. This section provides a mathematical setting of the portfolio optimization problem and its connection to Reinforcement Learning.

3.1 Assumptions

In this work, we only consider back-test tradings where the trading agent has no information about the future stock market. The trading agent is assumed to return at a timestamp in the stock market history and carries out paper trading from then onward. To meet the requirement of back-test tradings, we make two assumptions in our experiment:

  1. 1.

    Zero Slippage: The market assets are high in liquidity so that each transaction can be completed immediately after an order is placed.

  2. 2.

    Zero Market Impact: The transaction made by the trading robot is insignificant so it has no influence on the market.

In a realistic trading environment, these two assumptions are valid under the circumstance of the high trading volume in the stock market.

3.2 Mathematical formalism

To formulate our portfolio model, we modify the settings in Online Portfolio Selection (OLPS) [35]. The portfolio consists of one cash asset and m stock assets. The trading time is equally divided into periods of length T and length T equals to one day in this paper. Since it is assumed in the back-test experiments that at the beginning period of t + 1, assets can be immediately traded at the opening price of the period of t + 1, we are allowed to use the closing price v of period t to complete the transaction. More specifically, for a portfolio vector wt = [w0,t,w1,t,...,wm,t]T, where the first element is the weight of the cash and the other ith element represents the proportion of total capital invested in the ith stock at period t. We derive its price relative vector vt = [v0,t,v1,t,...,vm,t]T based on the ratio of tth closing price to the last closing price for the ith asset. Based on wt and vt, the final cumulative wealth after n periods is \( p_{f} = p_{0}{\Pi }_{t=1}^{n}\mathbf {w}_{t}^{T}\mathbf {v}_{t} \) where p0 is the initial investment, w1 = [1,0,...,0]T and \({\Sigma }_{i=0}^{m} w_{i,t} = 1\). The tth-step exponential growth rate rt is given by \( r_{t} = \log (\mathbf {w}_{t}^{T}\mathbf {v}_{t}). \)

Since the transaction cost is indispensable, OLPS [35] adopts the proportional transaction model [36, 37], i.e., the incurred transaction cost is proportional to the wealth transfer when reallocating wt. Specifically, it introduces a transaction cost factor [37] μt, which is the ratio of total wealth after reallocating to wealth before reallocating, and the one-step exponential growth rate rt can be rewritten as \( \label {trans} r_{t} = \log (\mu _{t}\mathbf {w}_{t}^{T}\mathbf {v}_{t}), \) where \(\frac {1-\gamma _{s}}{1+\gamma _{b}} \leq \mu _{t} \leq 1\), γs and γb are the commission fees of selling and buying stocks. When γs = γb = γ, Moody et al. [38] give an approximation to μt, i.e.,

$$ \mu_{t} = 1-\gamma\sum\limits_{i=1}^{m}\vert w'_{i,t}-w_{i,t} \vert $$
(1)

where \(w'_{i,t} = \frac {w_{i,t-1} \cdot v_{i,t}}{{\Sigma }_{j=0}^{m}w_{j,t-1}\cdot v_{j,t}}\) represents the adjusted portfolio weights due to the change in the stock price at time t.

In our work, we adopt the exponential growth rate rt with OLPS’s transaction cost, and use (1) to approximate it. Importantly, to complete our final one-step reward, we introduce an additional risk term and reformulate rt as

$$ r_{t} = \log(\mu_{t}\mathbf{w}_{t}^{T}\mathbf{v}_{t}) - \beta \text{Var}(r) $$
(2)

where β is a reward-risk adjust hyperparameter and Var(r) is computed through the variance of all previous r1:t.

4 Methodology

In this section, we first give a short description of the basic concepts in reinforcement learning, including value functions and loss functions, that are fundamental to subsequent algorithms proposed in Section 4.1. We then detail the novel architecture and description of the proposed reinforcement learning algorithm in Section 4.2.

4.1 Reinforcement learning: a short description of main concepts

As demonstrated in [39], reinforcement learning and control problems usually include an agent that acts in a stochastic environment by sequentially selecting actions over a sequence of time steps to maximize a cumulative reward. Generally, these problems can be formalized as discrete time stochastic Markov Decision Processes where an agent interacts with its surrounding environment in the following way: given a tuple \((\mathcal {S},\mathcal {A},\mathcal {P},\mathcal {R},\gamma )\), where

  • \(\mathcal {S}\) is a (finite) set of Markov states \(s \in \mathcal {S}\).

  • \(\mathcal {A}\) is a (finite) set of actions \(a \in \mathcal {A}\).

  • \(\mathcal {P}\) is dynamics (model-free) or an explicit transition model (model-based) for each action. For an explicit transition model satisfying the Markov property, it can be specified as \(\mathbb {P}(s_{t+1} = s'\mid s_{t}=s,a_{t}=a)\).

  • \(\mathcal {R}\) is an expected reward function under policy π and defined as \(\mathcal {R}(s_{t}=s,a_{t}=a)\triangleq \mathbb {E}_{\pi }[r_{t}\mid s_{t}=s, a_{t}=a]\).

  • γ ∈ [0,1] is a future discount factor.

For one single episode, the agent starts in a given state \(s_{0} \in \mathcal {S}\). At each time step t, it chooses an action \(a_{t} \in \mathcal {A}\) based on a policy π and receives an immediate one-step reward rt. It then keeps updating until it reaches a terminal state. All in all, our ultimate goal is to control an optimized policy π that can generate an optimal return at each state s. More details for the related definitions can be found in [39].

4.2 Proposed RL framework: from deterministic to stochastic

In this section, we first construct the deterministic actor-critic settings in Section 4.2.1. This is followed by a detailed description of how Stochastic Policy with Distributional Q-network (SPDQ) is implemented in Section 4.2.2.

4.2.1 Deterministic framework

Adapted from Q-learning, rather than globally maximizing Q, Silver et al. [13] utilize Deterministic Policy Gradient (DPG) to obtain the maximum return in a continuous action space through iteratively moving in the gradient direction of Q. In practice, for a deterministic policy μ𝜃, its policy parameters 𝜃k+ 1 are learned by gradient ascent \(\nabla _{\theta }Q^{\mu _{k}}(s,\pi _{\mu _{\theta _{k}}}(s))\). Similar to the actor-critic style algorithm, the critic in the DPG algorithm is learned by minimizing the Bellman error. Importantly, DPG [13] lays the theoretical foundation for Deep DPG [14].

In DDPG, Lillicrap et al. [14] incorporate deep neural network function approximators into DPG. In other words, for target value function (network) \( Q^{w^{\prime }}(s,a)\) and learned value function (network) Qw(s,a), DDPG introduces a method to slowly update the target network from a parameterized network rather than directly copying weights w to the targets. Practically, the weights of target network are slowly tracked by trained networks: w = τw + (1 − τ)w with τ ≪ 1. In this way, the target value functions can only update slowly, greatly enhancing the stability of learning. Additionally, in order to consistently train the critic without divergence, DDPG requires another target policy function \(\mu _{\theta ^{\prime }}\) that is also slowly updated by learned policy function μ𝜃 in the same manner of target value functions. Another contribution of DDPG is that it introduces a Gaussian Noise Process \(\mathcal {N}\) added to the continuous action spaces to encourage exploration. Generally, acting based on a deterministic policy may not ensure adequate exploration and may result in sub-optimal solutions, especially in a highly volatile financial environment.

4.2.2 Stochastic framework

Formally, we model the portfolio optimization problem with a trading cost as one Markov Decision Process \((\mathcal {S},\mathcal {A},\mathcal {R},\gamma )\) without considering the transition probabilities. In practice, the time horizon for this MDP is set to be the total holding time of the portfolio until the portfolio value pf reaches zero. At the beginning time of t, the trading agent generates a new portfolio weight and reallocates money to particular financial assets according to that weight. Here, we define the coupled states at time t as \(s_{t}:=\{\textbf {X}_{t},\textbf {W}_{t}\}\in \mathcal {S}\), where Xt is the historical stock features and Wt is the historical weights. In other words, we consider the previous weights of the portfolio to be also a part of state and concatenate them with the previous states along the dimension of feature. For the non-cash assets, the jth feature Xt,i,j for assets i at time t are built by a look-back time window with length l, i.e., \(\textbf {X}_{t,i,j}=\{x_{t-l,i,j},x_{t-l+1,i,j},\dots ,x_{t-1,i,j}\}\), in which xt− 1,i,j represents the basic information of feature j of asset i at time of t − 1. For cash, Xt,0,j is made up of unit vectors and collectively gives the same shape as Xt,i,j. In this setting, we have m + 1 assets (counting cash) and d + 1 features (previous assets weights as a new feature). This gives us a coupled state \(\mathbf {s}_{t} \in \mathbb {R}^{l \times (m+1) \times (d+1)}\). At the beginning holding period t, based on the inputted st, the trading agent will generate a new continuous action defined as \(\mathbf {a}_{t} := \mathbf {w}_{t} \in \mathcal {A}\) to redistribute the fund among the assets. Here \(\mathbf {w}_{t}\in \mathbb {R}^{m+1}\) and satisfies \({\sum }_{m} \mathbf {w}_{t} = 1\). For each state-action pair (st,at) at holding period t, its reward \(r_{t} \in \mathcal {R}\) satisfies (2).

4.2.3 Maximum entropy reinforcement learning

Instead of using the standard cumulative returns as the reinforcement learning objective, our goal is to learn a stochastic policy π(at|st) that maximizes the new entropy objective J(𝜃), i.e., \(J(\theta ) = \mathbb {E}_{s \backsim \rho ^{\pi _{{\theta }}}, a \backsim \pi _{{\theta }}}[r(s,a)+\alpha {\mathscr{H}}(\pi _{\theta }(\cdot \vert s))]\),where \(\rho ^{\pi _{{\theta }}}\) is the marginal state distribution, \({\mathscr{H}}(\cdot )\) represents an entropy function which is calculated as \({\mathscr{H}}(\pi _{\theta }(\cdot \vert s))=\mathbb {E}_{a \backsim \pi _{\theta }}[-\log (\pi _{\theta }(a \vert s))]\), and α stands for the temperature parameter that weighs the importance of the reward against the entropy term.

To model the diverse modality of our portfolio policy under different states, we suppose that the output of our policy network Θ(st) := {μ𝜃(st),σ𝜃(st)}, follows a Mixture Model with K multivariate gaussian components (Ni(μii),i = 1,2,...,K). Based on the formulation, we sample an action \(a_{t}^{\prime } \in \mathcal {A}^{\prime }\) from this policy network by performing the reparameterization trick. Thus, the probability density function of \(\mathcal {A}^{\prime }\) is given by \( p_{\mathcal {A}^{\prime }}(a^{\prime }) = {\sum }_{i=1}^{K}\omega _{i}N_{i}(a^{\prime };\mu _{i},{\Sigma }_{i}), \) where \({\sum }_{i=1}^{K}\omega _{i}=1\).

Subsequently, we introduce a map \(f: \mathcal {A}^{\prime } \to \mathcal {A}\) to map the original random variable \(\mathcal {A}^{\prime }\) to a simplex region that satisfies the properties in \(\mathcal {A}\). The function f is written as

$$ f(a_{i}^{\prime},\tau) = \frac{{\exp}(a_{i}^{\prime}/\tau)}{{\sum}_{j=1}^{h}{\exp}(a_{j}^{\prime}/\tau)+\delta}, $$
(3)

where \(\tau \in (0, \infty )\) is the temperature parameter that controls the weight distribution in different assets, and δ ≈ 10− 9 is a small number to ensure that the map f is invertible.

Consequently, the density function of \(\mathcal {A}\) after transformation is represented by \( p_{\mathcal {A}}(a) = p_{\mathcal {A}^{\prime }}(a^{\prime })\vert \det J_{f}(a^{\prime },\tau )\vert ^{-1} \), where Jf(⋅,τ) is the Jacobian of f(⋅,τ). Finally, the log-likelihood of action π(at|st) (entropy term) can be expressed as

$$ \begin{array}{@{}rcl@{}} \log\pi_{\theta}(a_{t}\vert s_{t}) &&= \log p_{\mathcal{A}^{\prime}}(a_{t}^{\prime})-\log \vert\det J_{f}(a_{t}^{\prime},\tau)\vert \\ &&\gg \log p_{\mathcal{A}^{\prime}}(a_{t}^{\prime})+h\log(\tau)-{\sum\limits_{i}^{h}}\log({a_{t}^{i}}) \end{array} $$
(4)

where \({a_{t}^{i}}\) represents the weight of the ith asset, the inequality part comes together from the Matrix Determinant Lemma [40] and proper scaling. Thus, it gives us a lower bound of \(\log \pi _{\theta }(a_{t}\vert s_{t})\) to simplify minimizing the log-likelihood itself. Detailed proof of this inequality is shown in Appendix A.

Finally, the policy parameters 𝜃 can be learned by minimizing the following equation from [17], i.e.,

$$ L_{\pi}(\theta) = \mathbb{E}_{s_{t} \backsim \mathcal{D},a_{t} \backsim \pi_{\theta}}[\alpha\log(\pi_{\theta}(a_{t}\vert s_{t}))-Q^{w}(s_{t},a_{t})] $$
(5)

where \(\mathcal {D}\) stands for the experience replay buffer. Importantly, in (4), we derive a lower bound for \(\log \pi _{\theta }(a_{t}\vert s_{t})\). Thus, to avoid gradient explosion during training, we introduce the lower bound to (5) and only minimize the lower bound of Lπ(𝜃). By extending the DDPG style policy gradient [14], we can approximate the gradient of policy network (5) with \(\nabla _{\theta }L_{\pi }(\theta ) = (\nabla _{a_{t}}\alpha \log (\pi _{\theta }(a_{t}\vert s_{t})) - \nabla _{a_{t}}Q(s_{t},a_{t}))\nabla _{\theta }g_{\theta }(s_{t};\epsilon _{t}), \) where g𝜃 = f ∘Θ and at is evaluated at g𝜃(st;𝜖t).

4.2.4 Distributional value function

For a policy π, instead of considering the cumulative return observed at each time t (i.e., the sum of discounted rewards Gt observed from one trajectory of states following the policy π) as a constant, we model it as a distribution to reflect the uncertainty of return in the real financial markets and denote it as \({Z^{0}_{t}}\), where \({Z^{0}_{t}} = {\sum }_{i=0}^{T}\gamma ^{i} R_{i}\) and Ri represents a random variable of ith step reward. Thus the reward distribution, together with the entropy term, is written as \(Z^{\pi }_{t} = {\sum }_{i=0}^{T}\gamma ^{i} R_{i}+\alpha {\mathscr{H}}(\pi )\). Based on this, the action-value function required in (5) is rewritten as

$$ Q^{\pi}(s,a) = \mathbb{E}_{s_{i} \backsim \mathcal{D}, a_{i} \backsim \pi}[\sum\limits_{i=t}^{T}\gamma^{i-t}R(s_{i},a_{i})+\alpha\mathcal{H}(\pi(\cdot\vert s_{i}))]. $$

In practice, our approximation to this value distribution aims to model the quantile numbers of the target distribution and we call it a quantile distribution. Accordingly, the output of the critic network is a vector of length N that represents N quantiles and its associated discrete cumulative probabilities are \(q_{i} = \frac {i}{N}\) for \(i=1,\dots ,N\) and q0 = 0.

Formally, let \(w:\mathcal {S} \times \mathcal {A} \to \mathbb {R}^{N}\) be the parametric model of our critic network. A probability quantile distribution \(Z^{w}:\mathcal {S}\times \mathcal {A} \to \mathcal {P}(\mathbb {R})\) maps each state action pair (s,a) to a uniform probability distribution supported in {wi(s,a)}. We write \(Z^{w}(s,a) := \frac {1}{N}{\sum }_{i=1}^{N}\delta (w_{i}(s,a))\), where δ(z) represents the Dirac function at \(z \in \mathbb {R}\).

Based on the one-step temporal difference learning, we train our critic network using Quantile Huber Regression [19] that minimizes the distance between the target distribution and the learned distribution.

The Quantile Huber Regression Loss in our problem is expressed as

$$ L_{Z}(w) = \mathbb{E}_{(s_{t},a_{t},r_{t},s_{t+1})\backsim \mathcal{D}}[\sum\limits_{i=1}^{N}\vert q_{i}-\delta(u_{i}<0)\vert \mathcal{L}_{k}(u_{i})] $$
(6)

where \(u_{i} = r(s_{t},a_{t})+\gamma (w_{i}^{\prime }(s_{t+1},\theta ^{\prime }(s_{t+1}))- \alpha \log (\pi _{\theta ^{\prime }}\) \((\theta ^{\prime }(s_{t+1})\vert s_{t+1})))-w_{i}(s_{t},a_{t})\), \(\mathcal {D}\) is the experience replay buffer, and \({\mathscr{L}}_{k}\) is the huber loss.

4.2.5 Stochastic Policy with Distributional Q-network (SPDQ)

At each time step, for any coupled state st = {Xt,Wt} of both policy network and critic network, where \(\textbf {X}_{t} \in \mathbb {R}^{l \times (m+1) \times d}\) is the historical stock features and \(\textbf {W}_{t} \in \mathbb {R}^{l \times (m+1)}\) is the previous assets weights, SPDQ encodes the coupled state by letting the dimension of features in st be the channel dimension and feeding st into the Conv2D layers with the same padding scheme on time dimension and the valid padding scheme on assets dimension. Subsequently, SPDQ merges the 22 assets dimensions into one single dimension while maintaining the length of time horizon simultaneously. Its outputs are followed by an LSTM Layer acting temporally to resolve the complexity between the long-range time horizon. To output a stochastic policy, the encoded state vector is directly fed into the Fully Connected Layer to generate means, standard deviations and mixture weights of the output action. For the critic network, the encoded state vector is concatenated together with the predicted action from the policy network, and fed into the FC Layers to output quantile numbers. The overall SPDQ reinforcement learning framework is shown in Fig. 1.

Fig. 1
figure 1

Proposed Stochastic RL Framework. Here, in the policy network, the coupled states \(\mathbf {s}_{t}\in \mathbb {R}^{l \times (m+1) \times (d+1)}\) are considered as input states to be fed into a policy encoder network (upper green block). Later, the encoded state vector is dropped into the FC layers (upper purple block) to generate means, standard deviation and mixture weights of the output action \(\mathbf {a}_{t} \in \mathbb {R}^{d+1}\). The sampled actions at are realized by performing a reparameterization trick. Subsequently, the new generated action at is simultaneously fed into the critic encoder network (lower green block) and the financial environment (left blue block), with which it can interact, to obtain reward rt and generate a new state st+ 1. In the critic network, the encoded layer (lower green block) input by st and at is then fed into the FC layers (lower purple block) to generate the quantile numbers of the value distribution Qt. Finally, after estimating Qt and Qt+ 1, value distribution is learned by using temporal difference

Practically, we initialize two neural networks for both the policy and critic networks: 𝜃 as the learned policy network, \(\theta ^{\prime }\) as the target policy network, w as the learned critic network, and \(w^{\prime }\) as the target critic network. Given an input state st, we first use the policy network 𝜃 to generate a new portfolio weight at, and use it to interact with the financial environment, obtaining a new reward rt and a new state st+ 1, and forming a one-step trajectory (st,at,rt,st+ 1). Subsequently, we store the trajectory in a Prioritized Experience Replay Buffer [41] and do not start training until the number of samples in the replay buffer reaches the batch learning size. During training, for sampled one step trajectory (st,at,rt,st+ 1), we update the policy parameters by maximizing the entropy in (5) and update critic parameter by minimizing the temporal difference in (6). For the temperature hyperparameter α to control the importance of entropy term, we update it by minimizing the temperature loss in (7). The loss of α is derived in [17], namely

$$ L(\alpha)=\mathbb{E}_{(s_{t},a_{t}) \backsim \mathcal{D}}[-\alpha\log\pi_{\theta}(a_{t}\vert s_{t})-\alpha\mathcal{H}_{0}] $$
(7)

where \({\mathscr{H}}_{0}\) is the minimum value of entropy.

After updating the gradient of the learned network, the weights of the target network are slowly adjusted by the trained networks in a similar manner as DDPG. Importantly, to reduce per-update error caused by noisy input, we choose to update the policy network and its target network less frequently than the critic network. The detailed algorithm for SPDQ is summarized in Algorithm 1.

Algorithm 1
figure h

SPDQ for portfolio optimization.

5 Experiment

In this section, we first introduce our data processing techniques and performance metrics. Next, we benchmark the deterministic algorithms of DDPG [14] and TD3 [15], and the stochastic algorithms of Distributional Deterministic Policy Gradient (D3PG) [19], Proximal Policy Optimization (PPO) [21], and SAC [17], in the provided U.S. stock market. Subsequently, we evaluate the proposed stochastic algorithms with the listed baseline algorithms. Finally, we display the interpretation of the model strategy, and investigate the impact of different hyperparameter choices, using an ablation study.

5.1 Dataset setting and preprosessing

The U.S. stock market data used in our experiments are obtained from WindFootnote 1. The time range of the data is from January 2005 to December 2020. This long time interval covers several well-known market events, such as the crash of 2008-2009 caused by the subprime mortgage crisis [42] and the ‘meltdown’ in 2020 caused by COVID-19 [43], which diversifies the market states and enables our trading agent to learn from real-world data fluctuations. Each collected stock contains nine different features ranging from the fundamental indexes like OPEN, CLOSE, LOW, HIGH to the technical indexes like BOLL and MACD. Concretely, 22 stocksFootnote 2 are chosen from S&P500 in the top 50 of the index’s component with large volumes, so our trading algorithms would not influence the market price. Detailed information related to stock names and feature names can be found in Supplementary Tables 4 and 5. In addition, we introduce one cash asset as a risk-free option for the trading agent. Moreover, the period of stock data used in the experiments is given in Table 1. Importantly, each feature is normalized by the first feature in the look-back window and scaled by a positive factor c.

Table 1 Period of stock data used in the experiments

5.2 Performance metrics

We use the following performance metrics to evaluate our algorithms:

  • Annual Rate of Return (ARR) [44] is the annual average return rate, it is defined as

    $$ ARR = \frac{p_{f}-p_{0}}{p_{0}} \times \frac{T_{year}}{T_{all}} $$

    where pf is the final portfolio value, p0 is the initial portfolio value, Tyear represents the total number of trading days within one year, and Tall is the total number of trading days.

  • Annualized Volatility (AVOL) [44] is the annual average volatility to reflect the average risk of a strategy in a year. It is defined as

    $$ \begin{aligned} AVOL &= \text{Var}\left[\frac{p_{t}-p_{0}}{p_{0}}\right] \times \sqrt{\frac{T_{year}}{T_{all}}} \end{aligned} $$

    where pt is the portfolio value at each step.

  • Annualized Sharpe Ratio (ASR) [44] is the risk-adjusted annual return based on APR and AVOL. It is defined as

    $$ ASR = \frac{ARR}{AVOL}. $$
  • Maximum DrawDown (MDD) [44] is the maximum loss from a climax to a dip of a portfolio, before a new climax is formed. It reflects the risk of the investment. It is defined as

    $$ MDD = {\max}_{t \in (0,T)}\left\{\frac{{\max}_{t^{\prime} \in (0,t)}\{p_{t^{\prime}}\} - p_{t}}{{\max}_{t^{\prime} \in (0,t)}\{p_{t^{\prime}}\}}\right\}. $$
  • Downside Deviation Ratio (DDR) [44] is the risk-adjusted annual return divided by the Downside Deviation which represents the potential loss that may arise from risk as measured against a Minimum Acceptable Return (MAR) such as bank interest. It is defined as

    $$ DDR = \frac{ARR}{\sqrt{\mathbb{E}[\min\{r_{t}-MAR,0\}^{2}]}}. $$

5.3 Results

5.3.1 Experiment settings

Each algorithm, including benchmarks, is trained in our experiment by interacting with an artificial financial environment for 100 episodes. Each episode randomly chooses a 500 time steps length consecutive holding period within the defined training period in Table 1. After training over one episode, our algorithm is then validated on a validation set with 300 time steps to assess its generalization ability. Our algorithm is implemented by Tensorflow on Python and trained through two RTX 2080 Ti Graphic Cards. The results of the stochastic algorithms are aggregated over an average of five replicates to ensure reliability.

In all experiments, we use a replay buffer with size 5000 and only consider behavior policies that are parameterized by Gaussian mixtures. For all the experimented algorithms we initialize the learning rate for the actor to be 5 × 10− 4. For the critic, we initialize the learning rate to be 5 × 10− 3. We use the exponential decay with a rate of 0.5 for both actor and critic. To optimize α, its learning rate corresponds to 1 × 10− 3 and the decay rate equals 0.9. Furthermore, we use a batch size of 64 for all the algorithms. The remaining hyperparameters, including look-back window size, τ, risk control factor β, etc., are fine-tuned on the proposed validation set.

5.3.2 Overall performance

We compare the proposed stochastic reinforcement learning algorithm, Stochastic Policy with Distributional Q-Network (SPDQ), with two popular deterministic algorithms (DDPG and TD3), three classic stochastic algorithms (SAC, D3PG, and PPO), and the standard market. In our experiments, the market value is calculated by consistently holding a uniformly weighted portfolio among these 22 stocks. Figures 2 and 3 give the cumulative wealth of the portfolio versus trading days in the U.S market on the validation and test sets. Specifically, from the plot of market value in Fig. 3, we observe this tested period is indeed the crashed period during COVID-19 [43]. It is straightforward that the deterministic algorithm DDPG experiences the most significant decline in March, although it outperforms the market. Surprisingly, another deterministic algorithm, TD3, has a poorer performance and falls far behind DDPG in both validation and test sets. Notably, among the tested stochastic algorithms, D3PG, which has a distributed critic, is superior to the market while PPO and SAC, which have a stochastic actor, nearly share the same cumulative wealth with the market. Importantly, it is worth pointing out that the proposed SPDQ consistently beats the market and has the fastest recovery rate after the ‘meltdown’ in March. Consequently, by comparing SPDQ with SAC, which only uses a stochastic policy, and D3PG, which only uses a distributional Q-function, we observe that the stochastic policy and the distributional Q-function jointly contribute to the final performance.

Fig. 2
figure 2

The Cumulative Wealth in U.S. market on the validation set for different models

Fig. 3
figure 3

The Cumulative Wealth in U.S. market on the test set for different models

Additionally, we evaluate the metrics of different algorithms in Table 2. From the table, we can observe that SPDQ has the best record on three risk-adjusted metrics. Specifically, SPDQ gives 63.1% ARR, up from D3PG 108% ranked in second. It also outperforms the market value in MDD by 10% (25.5% versus 28.2%). Interestingly, TD3 has the lowest AVOL and MDD. However, it comes at the expense of gaining cumulative returns. Specifically, it underperforms while competing with the market in ARR (8.3% versus 9.1%). Consequently, we conclude that the proposed stochastic framework attempts to maximize the cumulative returns at the cost of slightly increasing the volatility.

Table 2 Performance comparison between different deterministic and stochastic DRL algorithms in the U.S. market

5.3.3 Learning analysis

The learning curves on training and validation sets are given in Fig. 4. Intuitively, we observe that SPDQ has a better convergence property for the Q-Max that approximates the cumulative return on both training and validation sets. It also has a good generalization ability on the validation set. Moreover, SPDQ begins to level off after about 80 thousand training steps. On the contrary, the deterministic algorithms DDPG and TD3 fluctuate a lot during training, although the overall trend is increasing. Importantly, there is a gap between TD3’s training and validation curves, indicating that it may suffer from poor performance regarding unseen data. In summary, SPDQ evidently outperforms TD3 and DDPG on both the training and validation sets, which is consistent with the performance results on the test set.

Fig. 4
figure 4

Learning curves for Q-Max. Q-Max for TD3 and DDPG is calculated by maximizing Q-value in each batch. Q-max for SPDQ is calculated by maximizing the average of all the quantiles in each batch

5.3.4 Trading strategy interpretation

Here, we attempt to investigate the action patterns for different strategies, i.e., how the distribution of each asset changes over time. We discover that for the tested deterministic algorithms, especially DDPG, the weights directly converge to several assets within ten episodes, after which no further big changes were observed. In other words, the weights of each asset will fluctuate above or below a fixed mean that is invariant to time. The proposed stochastic algorithm, however, behaves more diversely than the deterministic ones. We observe the change of the trading strategy of the proposed SPDQ contains mainly three steps. It first uniformly distributes weights into 22 assets, then focuses on the several assets by putting more weights on them. Finally, after training for extended episodes, it converges to the assets that are assigned more weights to previously. Notably, the critical reason for the proposed stochastic algorithm to perform better is that our trading agent excels at selecting profitable long-term assets portfolio, and it chooses to consistently hold them instead of selling and buying shares frequently at every time step.

In addition, we create attribution maps of the input financial features and interpret the long/short actions by using the gradient-based methods in Integrated Gradient [45] and Alphastock [33] , which help us to quantify and visualize the critical features most valued by our trained model. Specifically, we aggregate the values derived from the Integrated Gradients of the inputted states st during all of the test time, and visualize them using a heatmap. Concretely, we pick ADBE.O, which shares the highest weights after training over 50 episodes, as a case study in Fig. 5(a). Among the inputted nine features, MACD has the highest score (positive gradients) during the last 15 to 10 days. Since the objective function of the policy is to the value function, positive gradients of MACD indicate that if a stock’s MACD keeps increasing in the last 15 to 10 days, the value function will also increase the next day. Consequently, our model considers MACD as a signal of future growth of the stock price and thus puts more weights on this asset. Figure 5(b) details how the proposed model executes orders. The selling or buying point is highlighted if the turnover rate is larger than one percent. We observe that its weights fluctuate around 0.074 which is nearly two times bigger than the average weights 0.043. This finding suggests that our proposed algorithm will attach more weights to those profitable assets instead of investing in all the assets averagely.

Fig. 5
figure 5

A case study of the Adobe after training 50 episodes

5.4 Ablation study

In our ablation study, we study the impact of changing the mixture numbers on the model’s final performance. As Table 3 demonstrates, when the policy is parameterized by a unimodal Gaussian (k = 1), SPDQ slightly outperforms the market on ASR and has the most stable return since it reaches the lowest AVOL among the other three options. When we start to increase its modality (k > 1), we find it may not lead to better performance and even gives worse results when k equals four. Interestingly, a bimodal Gaussian parameterization leads to a deficit portfolio and the greatest drawdown. However, a trimodal Gaussian parameterization produces a strategy that smoothly balances the return and risk.

Table 3 Ablation on the effect of Gaussian Mixtures in the U.S. market

Furthermore, Fig. 6 provides a comprehensive ablation study of the impact of the mixture numbers, reward-risk adjust factor β, temperature factor τ in (3), and Length of look-back window on the validation reward average and standard deviation. According to Fig. 6(a), we verify that the trimodal Gaussian parameterization (green curve) has relatively higher average rewards on the validation set while at the same time possessing a lower rewards standard deviation. Additionally, in Fig. 6(b), when the reward-risk adjust factor β equals 0.5, the rewards standard deviation is lower than that of adding no risk control until the training episodes are over 60. Nevertheless, the standard deviation becomes even higher if β keeps increasing to 1. Moreover, the original Softmax activation function with temperature τ = 1 has a smooth growing start of the average rewards in Fig. 6(c). However, it falls dramatically after training over 60 episodes. On the contrary, τ = 0.1 generates a more stable training process. Figure 6(d) demonstrates that a longer length of look-back window will not necessarily lead to better performance on the validation set. Instead, without compromising and taking too much risk, L = 20 gives the highest average rewards.

Fig. 6
figure 6

Reward learning curves on validation sets for different parameters

6 Conclusions

In this paper, we research on the continuous portfolio optimization with trading costs via deep reinforcement learning. We benchmark several classic deterministic and stochastic reinforcement learning algorithms on our artificial financial environment. Next, we propose a novel interpretable stochastic reinforcement learning framework for the portfolio optimization problem. Concretely, we build a stochastic policy parameterized by Gaussian Mixtures and a distributional critic realized by quantile numbers to interact with the noisy financial market. Finally, the extensive experiments demonstrate that our proposed stochastic algorithm outperforms its deterministic counterparts in terms of controlling risk and gaining profit in the U.S. stock market.

In the future, this research can be extended to the following aspects. First, incorporating Conditional-Value-at-Risk (CVaR) to the existing reinforcement learning framework and applying it to the actual financial market is a promising research direction since CVaR has a superior ability over the mean-variance settings to safeguard a decision-maker from risky movements. Second, investigating the customized exploration functions for the trading agents in reinforcement learning is very important and has the potential to outperform the strategy of exploring blindly based on the Gaussian distributions.