Multi-objective reward generalization: improving performance of Deep Reinforcement Learning for applications in single-asset trading

We investigate the potential of Multi-Objective, Deep Reinforcement Learning for stock and cryptocurrency single-asset trading: in particular, we consider a Multi-Objective algorithm which generalizes the reward functions and discount factor (i.e., these components are not specified a priori, but incorporated in the learning process). Firstly, using several important assets (BTCUSD, ETHUSDT, XRPUSDT, AAPL, SPY, NIFTY50), we verify the reward generalization property of the proposed Multi-Objective algorithm, and provide preliminary statistical evidence showing increased predictive stability over the corresponding Single-Objective strategy. Secondly, we show that the Multi-Objective algorithm has a clear edge over the corresponding Single-Objective strategy when the reward mechanism is sparse (i.e., when non-null feedback is infrequent over time). Finally, we discuss the generalization properties with respect to the discount factor. The entirety of our code is provided in open-source format.


INTRODUCTION
The algorithm developed by Fontaine and Friedman [17] (see also [6,14]) is a specific declination of Multi-Objective Reinforcement Learning (RL): while we postpone rigorous details to the next sections below, we can -for now -informally describe this as a RL declination which allows to simultaneously train all possible strategies associated with exploring a dynamic environment.Specifically, each strategy is identified by the reward mechanism associated with a given linear combination of multiple, pre-specified (and possibly conflicting) objective functions.This simultaneous learning process over all possible strategies, which gives the user the freedom to specify the combination of the objective functions after the training has been completed, makes the RL declination in [17] highly interpretable and versatile.
While the techniques in [17] have so far been successfully applied in several fields, applications of such a methodology for financial purposes are -to the best of our knowledge -still lacking.In this paper, we show the potential entailed by using such a Multi-Objective Reinforcement Learning algorithm -along with meaningful variations of the same -in the context of single-asset stock and cryptocurrency applications.We validate our results by deploying the algorithm on several important assets, namely: cryptocurrency pairs BTCUSD, ETHUSDT, XRPUSDT, and stock indexes AAPL, SPY, NIFTY50.Additionally, we discuss the generalization with respect to the discount factor parameter 1 .
We now give more precise details for the terminology so far informally introduced.

Reinforcement Learning: basics
Reinforcement Learning (RL) is a subfield of Machine Learning specifically designed to handle learning processes for problems which involve a dynamic interaction with a given underlying environment.RL techniques have initially been deployed -among many -in the field of gaming [31], robotics [25], personalized recommendations [48], and resource management [30].
1.1.1Single-Objective RL.A RL algorithm learns to use the set of observable state variables  (describing the current state of the environment) to take the most appropriate admissible action .For a general RL algorithm, the state variables and action (, ) and the environment's intrinsic stochasticity determine the next state of the environment that the algorithm will visit: the exact way in which this task is accomplished varies depending on the specific algorithm.For example, the RL algorithms of so-called critic-only form (see [43]) aim to maximize a single, cumulative reward (accounting for all actions taken in a given episode), which is in turn based on a pre-specified state-action reward function  (assigning a numerical reward  (, ) to every pair (, ) of given state and action undertaken).The algorithm uses several episodes (each accounting for an exploration of the environment which ends when a pre-specified end-state is reached) to train its decisionmaking capabilities (by progressively updating the so-called values [43] throughout the episodes).
1.1.2Multi-Objective RL.Multi-Objective2 RL is a declination of RL devoted to learning in environments with vector-valued -rather than scalar -rewards.Such a setting -which allows to cover realistic scenarios in which conflicting metrics are present -has been gaining a lot of traction lately.Two rather comprehensive surveys on the topic3 that we are aware of are [38,40].With methodologies including -among many others -Pareto front-type analysis [39,49], dynamic multi-criteria average reward reinforcement learning [34], convex hull value iteration [3], Hindsight Experience Replay techniques [2], dynamic weights computation [1], tunable dynamics in agent-based simulation [23], Deep Optimistic Linear Support Learning (DOL) for high-dimensional decision problems [33], Deep Q-networks techniques [37,44,45], and collaborative agents systems [12], it is safe to say that the Multi-Objective RL paradigm is well-established in several non-finance related applications.

Reinforcement Learning in finance
The problems we are interested in are related to using RL for profitable, risk-reduced trading of financial tools based on historical data.These problems are usually quite challenging to tackle using AI, and this is mostly due to three factors: i) high data noisiness of financial environment; ii) subjective definition of the financial environment, and iii) non reproducibility of the financial environment (only a single copy of any given time series is available for training).
Despite these difficulties, a prolific literature is available: the summary report [16] provides an exhaustive overview of the main works associated with the three most commonly used RL paradigms (i.e., critic-, actor-, and actor/criticbased techniques) for financial applications.Among critic-based works, we find superiority of RL over standard supervised learning approaches [35], performance improvement assessments with respect to varying reward functions and hyperparameters [8], Deep Q-learning (DQL) extensions to trading systems [22], evolutionary reinforcement learning in the form of genetic algorithms [10,19], identification of seasonal effects [13,46], high-frequency analysis [41], trade execution optimization [36], dynamical allocation of high-dimensional assets, [20,24], and hedging basis risk assessment [47].For actor-based methods, we mention recurrent reinforcement learning baselines [18,32], multi-layered risk management systems [9], and high-level feature extraction [11,21].Finally, we quote [4,7,29] as representatives of the actor/critic-based category.
1.2.1 Multi-Objective RL in finance.While the Multi-Objective RL paradigm is well-established in several non-finance related applications (see discussion of Subsection 1.1.2),it appears to still be relatively under-explored4 in the context of financial markets.
In this context, the most commonly taken approach to Multi-Objective RL is to indirectly embed the desired multi-reward effects in parts of the model other than the reward mechanism itself (e.g., collaborating market agents [26][27][28]).Another approach is to consider an intrinsic Multi-Objective approach, but without generalization (i.e., the reward weights are set a priori, and are not part of the learning process).This is the case for the two reference works [5,42], which we summarize in Section 3.

OUR CONTRIBUTION
We use a generalized, intrinsically Multi-Objective RL strategy for stock and cryptocurrency trading.We implement this by considering extensions of Multi-Objective Deep Q-Learning RL algorithm with experience replay and target network stabilization given in [17], and deploying it on several important cryptocurrency pairs and stock indexes.

Main Results
Our main findings -which we have validated on several datasets including AAPL, SPY, ETHUSDT, XRPUSDT, BTCUSD, NIFTY50are summarized as follows.
• Generalization.We show that our Multi-Objective RL algorithm generalizes well between four reward mechanisms (last logarithmic return, Sharpe ratio, average logarithmic return, and a sparse reward triggered by closing positions).
• Stability on prediction.We use two metrics (Sharpe Ratio and cumulative profits) to show that the prediction of our Multi-Objective algorithm is more stable than the corresponding Single-Objective strategy's.
• Advantage for sparse rewards.We show that the results of the Multi-Objective algorithm are significantly better than those of the corresponding Single-Objective algorithm in the case of sparse rewards 5 .
• Discount factor generalization.We provide partial evidence of generalization of the discount factor: this parameter is the RL time-regularization parameter in (1) below.
• Impact of fees.As per the nature of the current underlying RL algorithm (for both Multi-Objective and Single-Objective), trading fees diminish every performance to zero.This behavior is to be expected, see mitigating circumstances described in Subsection 2.3.2 and Subsection 2.3.3 below.

Validation of results
In order to show the robustness of our analysis, we adhere to the following general strategy.
• Large number of datasets.We run our experiments on a large group of single-asset financial time series: in particular, these include cryptocurrency pairs (BTCUSD, ETHUSDT, XRPUSDT) and stock indexes (AAPL, SPY, NIFTY50).
• Different initializations.In order to detach the impact of the chosen neural network's random initialization from the actual results, we perform several independent trainings for each given dataset we run our algorithm on.This is used to assess the distribution of gains for our algorithm.
• Cross-validation.We further confirm the effectiveness of our method using a plain hold-out method and walk-forward cross-validation with anchoring (i.e., where the training set starting date is the same for all folds).

Practical implications and considerations
We highlight some of the most important aspects of our approach from an applied perspective.

Simplicity and interpretability.
We deploy an intuitive and interpretable Multi-Objective RL algorithm in order to account for several -well established -reward mechanisms in single-asset trading.We view this methodology as having a strong applied connotation, and being complementary to other existing Multi-Objective strategies for financial problems [5,26,42].

2.3.2
Zero-fee setting.Although our analysis is conducted in a zero-fee market context, such a context nonetheless applies to many meaningful markets.For retail traders zero-fee spot trading is possible on multiple markets, for example BTCUSDspot on the cryptocurrency exchange Binance.For institutional traders even more opportunities exist, since they can often operate in low to zero or fixed fee market contexts by directly working together with market makers.

Underlying RL algorithm.
Our main focus is always to compare our generalized Multi-Objective RL methodology to a corresponding Single-Objective strategy: said differently, we do not so much dwell on increasing the performance of the underlying RL strategy (in fact, we choose a rather simple critic-based RL declination), as we do on evaluating the difference of the Single-/Multi-Objective methods for the same underlying RL method.
Remark 1.The adaptation of this methodology with more sophisticated underlying RL methods is deferred to future works.

Structure of the paper
We summarize the main contributions of the related works [5,42] in Section 3. We provide the abstract setup of our proposed Algorithm in Section 4, and fill in the necessary quantitative details in Section 5.After spelling out the main technical features related to our code (see Section 6), we discuss our main results in Section 7. Conclusions (respectively, future outlook) are given in Section 8 (respectively, Section 9).The precise details concerning our chosen underlying RL algorithm are provided in the Appendix A.

RELATED WORK
The works [5,42] use multi-reward scalarization to improve on the following, well-established benchmark strategies in the context of price prediction in single-asset trading: i) an actor-only, RL algorithm with total portfolio value as single reward, and; ii) a standard Buy-and-Hold strategy.More specifically, the authors take as rewards the average and standard deviations from the classical definition of the Sharpe ratio, and combine them with pre-defined weights to favor risk-adjustment.In another variation, the resulting scalarized metric is modified to further penalize negative volatility.
The authors use a two-block LSTM neural network to directly map the last previously taken action (Buy/Sell/Hold) and the available state variables to the next action.The first LSTM block is used for high-level feature extraction, and the other one for rewardoriented decision-making.From the experimental results the authors conclude superiority of their method over to the two benchmarks in terms of cumulative profit and algorithm convergence, although an analysis of statistical significance is not provided 6 .
While the scalarization approach is effective, it nonetheless has the downside of having to a priori specify the balance of the individual rewards (via their weights).This introduces a human factor into the balancing of rewards, and also restricts the scope of the learning process.Driven by this, we choose not to scalarize the reward metrics, so that the weights can be included in the learning process.To the very best of our knowledge, ours is the first application of Multi-Reward RL in the sense of [17] to financial data.

ABSTRACT DEFINITION OF THE MODEL
We consider variations of the Deep-Q-Learning algorithm with Hindsight Experience Replay and target network stabilization [43] (DQN-HER) for both standard Single-Reward or Multi-Reward structure (in the sense of [17]), and apply them to single asset trading problems.

The classical abstract setup
The basic structure of (DQN-HER) is concerned with maximizing cumulative rewards of the type where  ∈ (0, 1) is the so-called discount factor.The discount factor determines the time preference of the agent and regularizes the reward as  → ∞.A small discount factor makes short-term rewards more favorable.The algorithm fits a neural network taking the current state   as input and giving an estimate of the maximum cumulative reward of type (1) achievable by subsequently taking each permitted action   .The learning process is linked to the Bellman's equation update for a given learning rate  ∈ (0, 1), and where  (  ,   ) is the reward for taking action   in state   .
4.1.1Multi-Reward adaptation in the sense of [17].With respect to the previous case, the neural network's input is augmented by a reward weight vector , which is used to compute the total reward  •  (  ,   ) (here  is the vector of rewards, and • denotes the standard scalar product).The Single-Reward case can be seen as a declination of the Multi-Reward case with constant suitable one-hot encoding vectors .The (DQN-HER) algorithm is summarized in Algorithm 1 for the reader's convenience.

Our abstract setup
The methodology we adopt in this paper is summarized in Algorithm 2, and stems from the underlying basic Algorithm 1.For the sake of clarity, Algorithm 2 highlights only the changes between the two algorithms.These modifications are related to: i) an option for 'random access point': subject to specification by the user, each training episode may use a subset of the full price history of the training set, where the starting point is randomly sampled and the length is fixed; ii) the generalization of the discount factor  as suggested in [17]: this means that the neural network's input also comprised the discount factor  (i.e., input is augmented from (, ) to (, , )); iii) the specific choice of normalization spelled out in Subsection 4.2.1 below.

SPECIFIC DETAILS OF OUR MODEL
After having laid out the general structure of our RL setup (see Algorithm 2), we give precise substance to all quantities involved.

State variables 𝒔 𝑡
We define the state variables   as the vector comprising both the current position in the market (which we name   , and whose precise details are given in Subsection 5.2 below) and a fixed lookback of length ℓ over the most recent log returns of close prices {  }  : more explicitly, we set (5)

Admissible Actions and Positions
As far as actions are concerned, we analyze two scenarios: • Long Positions only (LP): The agent is only allowed to perform two actions (Buy/Hold) 7 , and consequently only switch between trading positions Long/Neutral.• Long and Short Positions (L&SP): the agent is allowed to perform three actions (Buy/Sell/Hold), and consequently switch between trading positions Long/Short/Neutral.

Rewards and Profit
For a given single-asset dataset with close prices {  }  , we define the logarithmic (portfolio) return at time  as Let  ∈ N be fixed.We focus on three well-established rewards (at a reference given point in time ), namely: , as well as the sparse, less conventional reward: iv) a 'profit-only-when-(position)-closed' (POWC) reward, defined as where   is the time of last trade (i.e., last position change).

EXPERIMENTS
We substantiate all necessary components involved in the simulations of our model (given in Algorithm 2).

Codebase
For the structure of our code, we took some inspiration from two open source repositories: the FOREX and stock environment at https://github.com/AminHP/gym-anytradingand the minimal Deep Q-Learning implementation at https://github.com/mswang12/minDQN . 7selling previously acquired assets 6.1.1Open source directory and reproducibility.Our entire code is provided in open source format at https://github.com/trality/fire .In particular: the instructions for reproducibility are contained in the README.mdfile therein; we provide the entirety of the datasets considered in our experiments.

Datasets
We perform several runs of experiments on a variety of relevant single-asset datasets, both in cryptocurrency and stock trading.
In the interest of increasing the training capabilities of our experiments (see Subsection 7.4 below), we always include an evaluation set in addition to train and test sets.The percentages of the data associated with training/evaluation/test sets are roughly train:64%eval:16%-test:20%.All datasets are of sufficient length as to provide a reasonable compromise between experiment running times, and significance of predictions.6.2.1 Cryptocurrency pairs.We consider the following datasets: hourly-data points for the BTCUSD pair (August 2017-June 2020); hourly-data points for the ETHUSDT pair (August 2017-June 2020); hourly-data points for the XRPUSDT pair (May 2018-March 2021).

Stock indexes.
We consider the indexes AAPL, SPY, NIFTY50.The date range for AAPL, SPY is January 2000-September 2022 (daily-data points), while for NIFTY50 is March 2020-June 2020 (minute-data points).
A snapshot example from the datasets BTCUSD and NIFTY50 is shown in Figure 1.

Quantities of interest and benchmarks
All our considerations will be based on the following -quite standard -quantities: • Total Reward: the cumulative reward over the considered portion of the dataset.• Total Profit: the cumulative gain/loss obtained by buying or selling with all the available capital at every trade.
• Sharpe Ratio: the average return per step, divided by the standard deviation of all returns.Crucially, results of Multi and Single-rewards simulations are compared against each other, as well as -individually -also against the Buy-and-Hold strategy.
6.4 Measures for code efficiency 6.4.1 Basic measures.The most important measures taken in this regard are as follows.Firstly, as we are primarily interested in assessing the potential superiority of a Multi-Reward approach over a Single-Reward one (see discussion in Subsection 2.3.3),we decide to stick to a simple Multi Layer Perceptron (MLP) Neural Network (Algorithm 1-line 2).Secondly, for the purpose of checking the performance in between training, we run the currently available model on full training and evaluation sets only for an evenly distributed subspace of episodes.
6.4.2Option of random access point.If we choose the walk-forward cross-validation, the training in each episode is performed on a randomly selected, contiguous subset of the full training set with pre-specified length (this reduces the overall training cost).

Vectorized computation of 𝑄-values.
Except for the trading position   , the time evolution of the state variables vector given in ( 5) is otherwise entirely predictable (as prices   obviously do not change in between training episodes).This implies that, given a predetermined set of  actions, the algorithm can efficiently vectorize the evaluation of the neural network for each separate trading position, and then deploy the results to speedily compute the associated  future steps.This method is feasible as the cardinality of admissible values of non-predictable state variables (i.e., the trading position) is low (three at most, in the L&SP case).

RESULTS
The results arising from several Multi and Single-Reward experiments (using Algorithm 2) on several datasets (BTCUSD, ETHUSDT, XRPUSDT, AAPL, SPY, NIFTY50) give us four general indications, which we discuss below in detail in as many dedicated subsections.We support our analysis using different types of plots, namely: i) distribution of performance (box-plots for various rewards over train/eval/test sets) obtained using several independent experiment realizations, see Figure 20 as an example; ii) cumulative rewards on train/eval sets (as functions of epochs), see Figure 6 as an example, and; iii) for each epoch, the Sharpe Ratio on eval set of the current best model, see Figure 17 as an example.

Multi-Reward generalization properties
The first crucial conclusion that we can comfortably jump to is that the Multi-Reward strategy generalizes well over all different rewards, and this can be seen on pretty much all plots which compare Multi-and Single-Reward.Firstly, the generalization can be observed in average terms, for instance, in Figure 20 (Single vs Multi for SR metric on AAPL and BTCUSD), Figure 4 (Single vs Multi for LR metric on ETHUSDT and BTCUSD with walk-forward validation) and from the comparison of Figures 2 and 3. Secondly, the generalization is also visible for cumulative rewards in Figures 15 & 16 (Single-vs Multi-for SR metric on BT-CUSD and ALR metric on NIFTY50).Thirdly, as far as the predictive power is concerned, the Multi-Reward method is as performing as -and sometimes better performing than -the Single-Reward counterpart, see Figures 17-18-19.
Remark 2. The training saturation levels may differ from those of the corresponding Single-Reward simulations, although this is likely caused by an apparent regularization effect of the Multi-Reward setting.
Remark 3. The performance in the case of non-null fees (see Figure 5) is poor: this is not related to the generalization (which seems to hold also in this case), but rather to the simple nature of the underlying RL algorithm, see discussion in Subsection 2.3.3.

Multi-Reward improvement on strongly position-dependent rewards
Let us consider a trading reward which i) is strongly dependent on a specific trading position, and; ii) is sparse (meaning that it might take several time steps for such a reward to return a non-zero value).
Intuition says that it is highly likely that the Single-Reward RL algorithm will struggle to learn based on such a reward.On the other hand, it is expected that a Multi-Reward algorithm will perform better, due to the influence of easier rewards with different but similar goals.Furthermore, the performance difference between Multi-Objective and Single-Objective method is expected to be even more pronounced when there are fewer trading positions allowed (thus further restricting the Single-case capabilities to learn).Below, we confirm these intuitions for the POWC reward -which satisfies i) and ii) above -by running the Multi-Reward RL code with all four rewards considered in Subsection 5.3 in its dictionary.7.2.1 Case LP .When opening Short positions is not allowed, the POWC reward provides a non-zero feedback only when Long positions are closed.This extremely sparse feedback is likely to be the justification of the poor cumulative training performance (Figures 6  and 7) where Single-Reward saturates the training at a much lower level than Multi-Reward.In contrast, the training for the Multi-case algorithm is much more consistent, as it can benefit from rewards with similar goals as POWC, but with more frequent feedback (e.g., LR).In the Multi-Reward case, the prediction performance exceeds the Buy-and-Hold threshold on the BTCUSD dataset in a more consistent and stable way than in the training saturation regime of the Single-Reward case.Furthermore, the average of Long positions 8  is lower (which means less risk is taken) and are also relatively stable.The difference in strategy between Multi and Single-Reward algorithm can be inferred by the different convergence of Long Positions.The analysis is further consolidated by considering the stark contrast in saturation levels in the distributional plots for the XRPUSDT and ETHUSDT, see Figures 8-9. 8The average of Long positions shows the time of capital exposure to the market.saturation, relevant differences are not noted.This is also visible in distributional plots (for instance, consider the SPY pair in Figures 2  and 10).

Random discount factor generalization
We have run several simulations with both fixed and randomly sampled discount factor (as suggested in [17]) on all datasets.The results are consistent across all simulations: therefore, we only show the results for the BTCUSD dataset in Figures 13 and 14 (Multi-Reward case only, for reward SR in both the LP and L&SP cases), as they are representative of all the remaining simulations.
We were able to notice the following general trends: -Generalization.A graphical comparison of performance indicators suggests that the algorithm generalizes with respect to the value of the random discount factor.
-Training saturation levels.There is a visible difference of saturation levels, with the random discount factor version saturating at a consistently lower level than its non-random discount factor counterpart.It is plausible that the discount factor generalization serves the purpose of a neural network regularizer.The difference is more pronounced in the L&SP case, i.e., when the agent is allowed to open Short positions.
-Evaluation set saturation levels and average positions.No significant differences are noticeable between the two cases.Taking everything into account, our simulations point in the direction of validating the discount factor generalization assumption provided in [17].Nevertheless, more extensive testing is necessary in order to fully confirm this.

Consistent indications for predictability
Although a full statistical justification of the obtained results is beyond the scope of this paper, we nonetheless have achieved indications that validate the effectiveness and robustness of the Multi-Reward approach.We further detail this statement.7.4.1 Consistent performance with respect to the Buy-and-Hold strategy.In the majority of simulations, the Multi-Reward is capable of improving over the Buy-and-Hold strategy benchmark (in terms of Sharpe ratio), on the evaluation and, more importantly, on the test set (Figure 20).Our validation has primarily relied on a distributional analysis across several experiments with independent initialization (Figures 2, 3   that the performance on the evaluation set (in terms of the Sharpe Ratio) is consistently good.In particular, the performance of the Multi-Reward model is at least as good as that of the Single-Reward  model, while also being much more stable 9 .Furthermore, the profits 9 Further work will be needed in the future to account for the exact impact of intrinsic RL noise.9 on the test sets are higher and more consistent in the Multi-Reward case (especially in the case of NIFTY50, see Figure 19).The results in Figures 17, 18, 19 also show that the performance on the test set is loosely correlated with the one on the evaluation set.This is likely due to the noisiness of the learning process, and a neat difference between evaluation and test set.In any case, the performance on the evaluation is a reliable indication for the improved stability, as it depends on the stability of the learning process.

CONCLUSIONS
Firstly, we have validated the generalization properties of a Multi-Reward, Reinforcement Learning method with Hindsight Experience Replay (in the declination given by [17]) by running experiments on several important, single-asset datasets (AAPL, SPY, ETHUSDT, XRPUSDT, BTCUSD, NIFTY50).
Secondly, even though a full statistical analysis would require further work, we have provided a number of consistent statistical indicators confirming the improved stability of our Multi-Reward method over its Single-Reward counterpart: distribution of performance over independent runs, convergence of prediction indicators, and profits for best performing models.
Thirdly, we verified that the Multi-Reward has a clear edge over its Single-Reward counterparts in the case of sparse, heavily position-dependent reward mechanisms.
Finally, we have partially validated the generalization property regarding the discount factor (as suggested by [17]), even though more work is required to consolidate the claim.The obtained results are, in nearly all occasions, subject to noise: more specifically, relevant performance indicators occasionally oscillate around -rather than approach -their limiting value (with

FUTURE OUTLOOK
While we have highlighted some of the benefits that a Multi-Reward approach has over a Single-Reward approach for predictive properties of a critic-only RL paradigm for single-asset financial data, many questions remain partially answered, or even wide open.
Firstly, while the Multi-Reward approach can stabilize and improve results of some rewards using the other ones, it is not clear exactly how the rewards are influenced by each other (normalization approaches different from (3) could be investigated).Secondly, it would be interesting to conduct a more thorough research into different lengths and non-uniform sampling mechanisms for the experience replay (see [15]).Thirdly, a more thorough analysis on the use of a random discount factor should be conducted.Fourthly, one might perform a sensitivity analysis on more hyperparameters.Fifthly, a more in depth analysis of the prediction power to provide statistical evidence is still needed.Finally, it would be interesting to i) the last logarithmic return (LR), which is exactly (6); ii) the average logarithmic return (ALR), given by ALR := mean {ℓ  }  = −(−1) ; ii) the non-annualized Sharpe Ratio (SR), given by SR := mean {ℓ  }  = −(−1) std {ℓ  }  = −(−1)

Figure 2 :
Figure 2: Distribution of the performance of multiple experiments (9, 13, 14, 17) with different random initialization for different assets on training, evaluation and test datasets, with multi reward.

Figure 3 :
Figure 3: Distribution of the performance of multiple experiments (7, 8) with different random initialization for different assets on training, evaluation and test datasets, with single reward (SR).

Figure 4 :
Figure 4: Distribution of the performance of multiple experiments (10) with different random initialization.Results are shown for ETHUSDT and BTCUSD on training, evaluation and test datasets.We compare multi and single reward using anchored walk-forward validation.

Figure 5 :
Figure 5: Distribution of the performance of multiple experiments (8, 10) with different random initialization for ETHUSDT on training, evaluation and test datasets, with multi and single (POWC) reward including fees of 0.03%.

Figure 8 :
Figure 8: LP case: Distribution of the performance of multiple experiments with different random initialization for XR-PUSDT and POWC on training, evaluation and test datasets, with multi and single reward.

Figure 9 :
Figure 9: LP case: Distribution of the performance of multiple experiments with different random initialization for ETHUSDT and POWC on training, evaluation and test datasets, with multi and single reward.

Figure 10 :
Figure 10: Distribution of the performance of multiple experiments (7) with different random initialization for different assets on training, evaluation and test datasets, with single reward (POWC).

7. 4 . 2
Comparing performance on training, evaluation, test sets.A commonly used model selection strategy is to pick the best performing model on the evaluation set.In Figures 17, 18, 19, the performance of such best performing model is shown (for training, evaluation, test) as it progresses through the episodes.We observe

Figure 17 :
Figure 17: Top (Bottom): Best model for profits (performance for training/evaluation/test set based on SR), BTCUSD

Figure 18 :
Figure 18: Top (Bottom): Best model for profits (performance for training/evaluation/test set based on ALR), BTCUSD

Figure 19 :
Figure 19: Top (Bottom): Best model for profits (performance for training/evaluation/test set based on SR), NIFTY50 8.1 Limits of the RL setting

Figure 20 :
Figure 20: Distribution of the performance of multiple experiments with different random initialization for BTCUSD and AAPL on training, evaluation and test datasets, with multi and single reward.
19)gth of the Experience Replay.We exclusively use a sameage type experience replay: more precisely, the experience replay's oldest element (measured in number of network updates following its creation) is the same for both Single-and Multi-Reward case.In the Multi-Reward case, where extra -not visited -experiences are thrown into the replay (Algorithm 1-line19), this results in a longer replay.