Parameter-free Reduction of the Estimation Bias in Deep Reinforcement Learning for Deterministic Policy Gradients

Approximation of the value functions in value-based deep reinforcement learning induces overestimation bias, resulting in suboptimal policies. We show that when the reinforcement signals received by the agents have a high variance, deep actor-critic approaches that overcome the overestimation bias lead to a substantial underestimation bias. We first address the detrimental issues in the existing approaches that aim to overcome such underestimation error. Then, through extensive statistical analysis, we introduce a novel, parameter-free Deep Q-learning variant to reduce this underestimation bias in deterministic policy gradients. By sampling the weights of a linear combination of two approximate critics from a highly shrunk estimation bias interval, our Q-value update rule is not affected by the variance of the rewards received by the agents throughout learning. We test the performance of the introduced improvement on a set of MuJoCo and Box2D continuous control tasks and demonstrate that it considerably outperforms the existing approaches and improves the state-of-the-art by a significant margin.


I. INTRODUCTION
The policy optimization in reinforcement learning (RL) has achieved notable successes in a wide range of sequential decision-making tasks, such as for robotic control [1], [2] or time-series prediction [3].However, in the deep setting of RL, where deep neural networks approximate value functions and policies, there exist several issues [4].The systematic estimation bias that prevents the learning agents from attaining maximum performance and applicability of the deep techniques to diverse real-world tasks is one of the difficulties originating from the function approximation [4], [5].For discrete action spaces, the estimation bias on the value estimates has been widely investigated for the valuebased RL algorithms [6]- [10].In addition, similar work is done in the continuous action domains with actorcritic techniques for the subtype of the estimation bias, namely, overestimation bias [4].However, we demonstrated in our recent work [11] that actor-critic methods that overcome the overestimation bias and accumulated high variance induce an underestimation bias on the action-value estimates.
In continuous control, the estimation bias on the action-value estimates is generally examined under underestimation and overestimation [11].Overestimation bias, caused by the maximization of the noisy estimates in traditional Q-learning [12], results in a cumulative estimation error on the action values (state-action values or Q-values) throughout the learning stage [4].As deep neural networks represent the action and value functions in the deep RL setting, such a function approximation noise is inevitable [4].Due to the temporal difference (TD) learning [5], this inaccuracy in the value estimation is further amplified [4].The underestimation bias, in contrast, is an outcome of Q-learning [12] variants that focus on eliminating the accumulated overestimation bias [11].Although a recent objective function proposal in the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm [4], Clipped Double Q-learning, is shown to eliminate the overestimation bias and accumulated variance, it can nevertheless decrease an RL agent's performance by assigning low values to optimal stateaction pairs and thus, may result in suboptimal policies and divergent behaviors [11].
In the Clipped Double Q-learning algorithm [4], two Q-networks with identical structures and different parameters are initialized before the learning process [4].The minimum of these critics' estimates is utilized to form the objective of Q-networks during learning.Despite the decoupled actor and critics, using the minimum Q-value in learning the targets results in persistent underestimation of the state-action values [11].Recent works, Weighted Delayed Deep Deterministic Policy Gradient (WD3) [13] and Triplet-average Deep Deterministic Policy Gradient (TADD) [14], focus on this existing underestimation bias in the TD3 algorithm [4] and introduce a linear combination of the functions of action-value estimates in forming the objective of Qnetworks.Although the recent objective function proposals [13] and [14] are shown to reduce the underestimation bias and improve the state-of-the-art, their theoretical assumptions on the underestimation of Q-values are either on a strong basis or infeasible assumptions that prevent their approach to be adapted to the off-policy learning.Furthermore, our recent work for the problem of the underestimation bias, Triplet Critic Deep Deterministic Policy Gradient (TCD3) [11], heuristically searches for an alternative for the Q-network objective and proposes a combination of three approximate critics.However, maintaining three deep networks comes with a great computational complexity compared to the TD3 algorithm [4].
In this paper, we extend our previous study [11] on the estimation bias such that we first examine the current strategies that aim to overcome the underestimation bias in deterministic policy gradient (DPG) [15] algorithms.We address the detrimental issues in these algorithms and explain the infeasible assumptions in their theoretical background.We then derive a closed-form expression for the estimation error yielded by the Clipped Deep Q-learning algorithm [4] and our previous work TCD3 [11], without any statistical assumptions that violate the off-policy RL paradigm, which was not introduced in our previous work.Using the derived closed-form expressions, we introduce a new variant of Deep Qlearning [16], Stochastic Weighted Twin Critic Update, that achieves superior performance to our previous work but with using only two critics and hence, having a 33% less time complexity.Our approach derives a parameterfree linear combination of the functions of two approximate critics, weights of which are sampled from a bias interval that corresponds to a significantly smaller underestimation bias than the existing approaches.In addition to our previous work on the underestimation bias, the main contributions of this study can be summarized as follows: • We first address the issues with the existing algorithms that focus on the underestimation in deterministic policy gradient [15] methods.We explain why the statistical assumptions made in those works cannot be adapted to the off-policy deep RL in continuous action spaces.• We derive a closed-form expression for the estimation bias in the Clipped Double Q-learning algorithm [4] and TCD3 [11], and theoretically show that if the rewards that the agent receives vary on a large scale, the underestimation of the actionvalue estimates detrimentally increases.• Through an extensive statistical analysis of the expected error in the existing approaches and derived closed-form expressions, we introduce a stochastic Q-network update rule in which weights are sampled from a bias interval that is substantially smaller than the expected errors in the existing approaches and TCD3 [11].
• We empirically verify our claims by comparing the actual and estimated Q-values produced by the WD3 [13] and TADD [14] algorithms and demonstrating that an increasing variance of the received reward signals increases the underestimation throughout the learning.• Our method is not affected by the variance of the reward signals as it samples the weights of the Q-networks from an interval, the lower bound of which is linearly decreased.An extensive set of empirical studies in several challenging OpenAI Gym [17] tasks reflect our theoretical claims and show that the introduced approach dramatically outperforms the competing methods and improves our previous study.• The source code of our algorithm is publicly available at our GitHub repository1 to ensure reproducibility.

II. RELATED WORK
Prior studies on the approximation error in reinforcement learning have been done by [18], [19] in terms of the estimation bias and resulting high variance build-up.This paper focuses on one of the function approximation error outcomes, namely, underestimating the actionvalues.In the following, we extensively investigate the background of the estimation error phenomenon in deep reinforcement learning.

A. Discrete Action Spaces
The estimation error induced by the maximization of Q-values has been extensively studied in discrete action spaces.For Deep Q-learning [16], many techniques were proposed to mitigate the impacts of the overestimation bias caused by the function approximation and policy optimization.Van Hasselt et al. [6] address the function approximation error for discrete action spaces in their work, Deep Double Q-learning (DDQN), which is one of the successor studies to the Deep Q-learning [16].By employing two independent and identically structured Q-value approximators, DDQN [6] obtains unbiased Qvalue estimates.Lan et al. [7] modifies Deep Q-learning [16] through the utilization of multiple action-value estimators.Their approach, Maxmin Q-learning [7], uses multiple action-value estimates selected through partial maximum operators, the minimum of which constructs the Deep Q-learning target [16].Although [7] primarily aims to eliminate the overestimation, they show that their method may yield in underestimation [7].Additionally, methods that employ multi-step returns are shown to overcome the estimation bias [8]- [10], [20], [21] and proven to be effective through distributed approaches [9], weighted Q-learning [20], and importance sampling for off-policy correction [8]- [10], [22].However, these methods introduce a trade-off between the biased actionvalue estimates and accumulated variance, as shown by [22].Furthermore, these approaches use impractical longer horizons than one-step solutions to the biasvariance trade-off [22].Moreover, [23] proposed a onestep improvement for the reduction of the contribution of each erroneous estimate by reducing the discount factor in a structured manner.

B. Continuous Action Spaces
Although the estimation bias in discrete action space is overcome by the existing Deep Q-learning [16] variants, they cannot be adapted to the control of the continuous systems due to the employment of a separate actor network [4].As there exist infinitely many intractable actions in continuous action domains, the maximization of Q-networks cannot be used to select actions.Hence, the previously introduced methods for discrete action domains cannot be used [4].Nonetheless, a direct and one-step solution to the overestimation and variance accumulation has been proposed by [4].It is shown to be effective in eliminating the function approximation error for the deep setting of RL.Their research demonstrates that the deep function approximation of Q-values causes overestimation bias and cumulative variance in continuous action domains, which causes the approximate gradient of the actor network to diverge from the actual gradient.An extension of temporal difference learning [8] in the DPG [15] methods, Clipped Double Q-learning [4], on which we build our algorithm, presents a direct remedy to the overestimation problem by employing two identically structured Q-networks.On top of Clipped Double Q-learning [4], the delayed actor updates and target policy regularization through additive policy noise constitute the TD3 algorithm [4], which is shown to exhibit a state-of-the-art performance.TD3 [4] overcomes the overestimation build-up by performing the target Q-value computation through the minimum of two approximate critics.Their introduced update rule, Clipped Double Q-learning [4], is used in many stateof-the-art continuous control algorithms.
Although the improvements introduced by Clipped Double Q-learning [4] can eliminate cumulative estimation error, the use of the minimum of two critics causes an underestimation bias in the Q-value estimations, as stated by [4] and empirically shown by [11], [13], [14].Several techniques have been proposed to address the underestimation problem, including the use of a linear combination of the Q-value estimates by approximate critics to compute the objective for Q-network update [13], [14].We extensively review these approaches along with our previous work TCD3 [11] in the later sections.

III. BACKGROUND
Reinforcement learning paradigm considers an agent that interacts with its environment to learn the optimal, reward-maximizing behavior.The standard reinforcement learning is represented by a partially or fully observable Markov Decision Process (MDP) defined by the tuple (S, A, p, γ) where S and A denote the state and action spaces, respectively, p is the transition dynamics and γ is the constant discount factor.At each discrete time step t, the agent observes its state s ∈ S and chooses an action a ∈ A according to its policy π φ , stochastic or deterministic, parameterized by φ.Then, based on its action decision given the observed state, the agent receives the reward r from a reward function corresponding to its environment, and observes a next state s such that s , r ∼ p(s, a).The objective of the agent is to maximize the cumulative reward defined as the discounted sum of future rewards R t = T i=t γ i−t r(s i , a i ) where the discount factor γ ∈ [0, 1) downscales the longterm rewards to prioritize the short-term rewards more.
The agent learns the optimal policy π * that maximizes the expected return E si∼pπ,ai∼π [R 0 ].In actor-critic settings where action space is continuous, parameterized policies π φ represented by deep neural networks with parameters φ are optimized by computing the gradient of the expected return ∇ φ J(φ) through a policy gradient technique.In this study, we consider the deterministic policy gradient algorithm expressed by: The expected return after taking the action a given the observed state s under the current policy π is computed by the critic (action-value function or Qfunction) Q π (s, a) = E si∼pπ,ai∼π [R t |s, a] which values the quality of the action decision given the observed state while following the current policy π.The critic evaluates and improves the agent's policy to obtain higher quality action choices.
In standard Q-learning [12], if the transitions dynamics of the environment is accessible, the action-value function Q π is estimated through recursive Bellman optimization [24] given the transition tuple (s, a, r, s ): For large state and action spaces, the action-value function is usually estimated by function approximators Q θ (s, a) parameterized by θ, also known as the Q-networks.In the deep setting of Q-learning [12], the Q-network is updated through the temporal difference learning [8] by a secondary frozen target network Q θ (s, a) to construct the objective for behavioral Qnetwork, also known as Deep Q-learning [16]: where the next actions given the observed next state can be obtained from a separate target actor network π φ for actor-critic settings in continuous control.The target networks are either updated by a small proportion τ at each time step, i.e., θ ← τ θ + (1 − τ )θ , called softupdate, or periodically to exactly match the behavioral networks called hard-update.

IV. THE UNDERESTIMATION BIAS IN DETERMINISTIC POLICY GRADIENTS
A. An Informative Analysis on the Existing Approaches to the Underestimation Bias We start by explaining the current approaches to the underestimation bias in the literature.Mainly, we investigate the WD3 [13] and TADD [14] algorithms and the theoretical background of these algorithms.These studies extend the Clipped Double Q-learning algorithm [4] by replacing the Q-networks' objective with a fixed linear combination, as discussed.Let us first consider the WD3 algorithm [13].In the simplest terms, Q-networks are updated as follows: where ã is the action chosen by the target policy in the next state s perturbed by a zero-mean Gaussian noise, i.e., ã = π φ (s ) + N (0, σ), σ is the standard deviation of the perturbation noise, and J(θ i ) is the loss associated with critic Q θi .Here, β ∈ [0, 1] is a parameter that controls the underestimation since minimum operator yields the underestimation of Q-values [4], [11].Note that this additive exploratory noise does not alter the expected function approximation error by having a zero mean [4].The TADD algorithm [14] adopts a similar approach through an additional third critic employed in Q-learning [12].In addition, the last K parameters of the third critic is stored in a critic network buffer, which is used to construct the objective for the Q-networks, particularly: where the average of last K assists in reducing the variance of the Q-value estimates [14].
In these studies, the errors by the employed Qnetworks are represented by probability distributions, which is feasible as the employment of deep neural networks and bootstrapping in the Q-learning introduce noise in the action-value estimates [4], [5].Based on such a probabilistic representation, there exist two assumptions made by these works based on the error distributions.First, it is stated that the error by each of the critics can be represented either by a zero-mean Gaussian or a zero-mean uniform distribution.Second, error distributions by the two critics are independent and identically distributed, as shown by Theorem 1 and 2 in [13] and by Theorem 1 in [14].Formally, we express the made assumptions in [13] and [14] as: for some parameters δ and σ, where Q * denotes the actual Q-value of the state-action pair (s, a).However, the zero-mean assumption violates the existence of the estimation bias: The latter equation is satisfied since Q * (s, a) is the fixed point of the Bellman operator T π * [24] under the optimal policy π * [10].Then, from (13), we infer that each Q θi (s, a) is an unbiased estimator of Q * (s, a) which contradicts with the existence of an estimation bias.Furthermore, errors of the two critics cannot be entirely independent due to the employment of the opposite critic in learning the targets, as well as the same replay buffer [4].Therefore, assumptions made in the current approaches to the underestimation bias violate the nature of the Q-learning in off-policy and deterministic policy gradient [15] methods.Finally, we can conclude this section with the following remarks.
Remark 1. Estimation error by the two critics in the Clipped Double Q-learning algorithm [4] cannot follow a zero-mean probability distribution.If so, then the existence of an estimation bias is violated.
Remark 2. Error distributions by the two critics in the Clipped Double Q-learning algorithm [4] are not independent due to the employment of the opposite critic in learning the targets and the use of the same replay buffer.

B. Derivation of the Closed-Form Expression for the Underestimation Bias
By considering Remark 1 and 2, we begin to derive a closed-form expression for the estimation bias in the Clipped Double Q-learning algorithm [4].The presence and effects of overestimation in actor-critic settings are highlighted in [4] through the gradient ascent in the policy updates.However, using the minimum operator to compensate for the overestimation of Q-values may result in underestimated action-value estimates.We begin by proving through basic assumptions and claims that the underestimation phenomenon exists in DPG [15] algorithms for environments with varying reinforcement signals.We follow the gradient ascent approach in [4] to show such underestimation.
In the TD3 algorithm [4], the policy is updated using the minimum value estimate by two approximate critics, Q θ1 and Q θ2 , parameterized by θ 1 and θ 2 , respectively.Without loss of generality, we assume that both critics overestimate the action-values, and the policy is updated with respect to the first approximate critic Q θ1 (s, a) through the DPG algorithm [15].The assumption on the overestimation of both Q-networks is valid as the single Q-network in the Deep Deterministic Policy Gradient (DDPG) algorithm [25] already overestimates the Qvalues, as shown by [4].First, let φ approx define the parameters from the actor update by the maximization of the first approximate critic Q θ1 (s, a): where Z 1 is the gradient normalizing term such that Z −1 E[•] = 1, and η > 0 is the learning rate.As the actor is optimized with respect to Q θ1 (s, a) and the gradient direction is a local maximizer, there exists ζ sufficiently small such that if η < ζ, then the approximate value of the policy, π approx , by the first critic will be bounded below by the approximate value of the policy by the second critic: Note that for the latter equation, there could be a local maximizer for which However, such a possibility can be neglected in actor-critic algorithms that utilize Clipped Double Q-learning [4] since the actor is always optimized with respect to the first critic Q θ1 [4].Then, we can treat the function approximation error for both critics as distinct Gaussian random variables: Following (15) and Remark 1, we have µ 1 ≥ µ 2 ≥ 0.
As the same experience replay buffer [26] and opposite critics are used in learning the target Q-values and critics, error Gaussian's denoted by (16) are not entirely independent according to Remark 2. Through the first moment of the minimum of two correlated Gaussian random variables [27], the expected estimation error for the Clipped Double Q-Learning algorithm [4] becomes: where θ := σ 2 1 + σ 2 2 − 2ρσ 1 σ 2 , ρ is the correlation coefficient between N 1 and N 2 , and Φ(•) and ψ(•) are the cumulative distribution function (CDF) and probability density function (PDF) of the standard normal distribution, respectively.Due the presence of the delayed actor updates, the mean function approximation errors by both critics are not very distant due to the decoupled actor and first critic.Hence, for simplicity, we can assume that µ 1 ≈ µ 2 .Using this, (17) reduces to: since then the action-value estimate will be underestimated: From σ 1 , σ 2 > π/(1 − ρ)µ 1 condition, if the pair of critics are highly correlated, underestimation does not exist.However, there exists a moderate correlation between the pair of critics due to the delayed policy updates which increases the underestimation possibility [4].
Although the improvements by [4] aim to reduce the estimation error growth, the variance of the Q-values cannot be eliminated as they are adhered to the variance of the future value estimates and rewards [4].Furthermore, the Bellman equation [24] in function approximation settings cannot be exactly satisfied [4], which results in erroneous Q-value estimates as a function of the actual TD-error expressed by (16).Then, we can show that the variance of the value estimates increases as the agent receives reward signals that vary on a large scale due to the exploration [28].As shown in [4], the Q-value estimates can be expressed in terms of the expected sum of discounted future rewards: If the expected estimation errors by both critics are constant, varying reinforcement signals increase the variance of the Q-value estimates resulting in an increasing underestimation bias.Since an extensive exploration is a mandatory requirement for continuous action spaces [28], the variance of the reinforcement signals usually increases throughout the learning phase.Therefore, the underestimation bias on the value estimates becomes unavoidable.Moreover, in the underestimation case, the estimation error is not accumulated due to the TD learning [5], [8].Thus, the underestimation bias is far preferable to the overestimated Q-values in the actorcritic setting [4].Nevertheless, underestimated actionvalues may discourage agents from choosing good stateaction pairs for an extended period and reinforce agents to value suboptimal state-action pairs more frequently [11].
Remark 3. A varying set of reinforcement signals increase the variance of the Q-value estimates, which results in an increasing underestimation bias.
We can show the existence of the underestimation bias in practice by comparing the true and estimated Q-values while an agent under the TD3 algorithm [4] is learning on a set of OpenAI Gym [17] continuous control tasks over a training duration of 1 million time steps.The simulation results are reported by Fig. 1.We randomly select 1000 state-action pairs at every step and obtain the estimated Q-values by the first Q-network.The true Q-values are obtained at every 100,000 time steps by computing the discounted sum of rewards starting from a randomly sampled 1000 states following the current policy.The Monte-Carlo simulation [29] is used over the randomly selected states and state-action pairs to obtain the average true and estimated Q-values.
From Fig. 1, we observe an apparent underestimation bias throughout the learning phase such that the estimated Q-values are smaller than the true ones except for a small proportion of the initial time steps.The underestimation bias arises depending on the environment and either grows or settles to a fixed level.These simulation results verify our claims; the approximate critics overestimate the actual Q-values at the initial steps.However, when the agent starts exploring the environment and encounters varying rewards, the variance of the value estimates increases, and the underestimation bias starts growing.For BipedalWalker and LunarLanderContinuous, the underestimation bias becomes fixed after a duration.This is due to the span of the reward space.If the agent encounters a sufficiently large subspace at the beginning of the learning, the underestimation bias cannot become larger.However, suppose the agent does not receive a significantly large subspace.In that case, the underestimation bias keeps growing even with the delayed target and actor updates as in the rest of the environments.Although the continuous, multi-dimensional, and large state-action spaces contribute to the growth of error, the scale of the current RL benchmarks is still very small compared to the real-world tasks [30].Hence, the underestimation bias will be more detrimental and inevitable when larger-scaled tasks are introduced.
To overcome the shown underestimation bias, we first start by deriving the expected error induced by the update rule in TCD3 [11], and reduce the number of Q-networks to two while obtaining the same expected error.Then, through an extensive analysis of the WD3 [13] and TADD [14] algorithms, we introduce our novel, hyperparameter-free modification on the target Q-value update that can further reduce the underestimation bias while preventing the overestimation.

A. Methodology
First, we consider the Q-network update rule in our previous work, the TCD3 algorithm [11]: where we employed an additional third critic Q θ3 with corresponding estimation error distribution N 3 ∼ N (µ 3 , σ 3 ).As the first critic is used to optimize the policy and due to the randomness in transition sampling, the same probability distribution can represent the errors corresponding to the second and third critics, i.e., N 3 ∼ N (µ 2 , σ 2 ).We previously showed that this update rule can upper-and lower-bound the Q-value estimates by taking the minimum of the maximum of the first two critics and the third critic.Now, let us derive the expected function approximation error induced by the Q-value target expressed by (21).First, expand min(max(N 1 , N 2 ), N 3 ) in terms of the maximum of error Gaussian's: It is not trivial to compute the expectation of the latter term in the right-hand side of (22).However, we can rewrite (22) in terms of the maximum of three correlated Gaussian's and use the derivation for its expectation for equal means case from [31].For this purpose, let Then, the expected value of ( 22) can be expressed as: Under the assumption made in section IV-B that µ 1 ≈ µ 2 = µ 3 , let us define µ := µ 1 = µ 2 = µ 3 .Now, we can directly import the special case for the expectation of maximum of correlated Gaussian's from [31].The equal means case states that if N i ∼ N (µ, σ i ), then the expected value of maximum of three Gaussian's can be expressed as: where θ i,j := σ 2 i + σ 2 j − 2ρσ i σ j .Due to the same experience replay [26] used in updating the Q-networks and decoupled actor and the first critic, without loss of generality, we can further assume that θ := θ 1,2 = θ 1,3 = θ 2,3 .Then, (25) reduces to: Furthermore, using the exact distribution of E[max(N 1 , N 2 )] from [27], similar to (17), we have: Using the assumptions made, we can simplify (27) into: Inserting ( 26), ( 28) and E[N 3 ] = µ into (24), we derive: Replacing µ with µ 2 , we can express expected function approximation error for min(max(Q 1 , Q 2 ), Q 3 ) in terms of the expected error for the Clipped Double Q-learning [4] denoted by (17) as: This expected estimation bias is slightly less than the average of the underestimation in TD3 [4] and overestimation in the DDPG algorithm [25].As the variance of the value estimates by two correlated critics are greater than the expected function approximation error, ( 30) is still an underestimation.We can further reduce this underestimation by replacing µ 2 with µ 1 in (30) as µ 1 ≥ µ 2 ≥ 0: (31) Observe that the expected value of ( 21) and ( 31) are the same.We eliminate the computational burden introduced by the employment of the third Q-network while attaining the same expected error.Hence, the computational complexity is reduced by 33%.Now, let us show the expected error by the WD3 [13] and TADD [14] algorithms.Update rules in these methods were previously expressed in ( 4) and ( 6), respectively.Using the error Gaussian distributions in (16) and expectation of minimum of two correlated Gaussians in (18), expected error of WD3 [13] is expressed as: Note that ( 32) is satisfied as µ ≈ µ 1 ≈ µ 2 = µ 3 .Similarly, TADD [14] yields the following expected error: where again, the latter equations are satisfied by µ ≈ µ 1 ≈ µ 2 = µ 3 .Essentially, from ( 32) and ( 33), we observe that the estimation bias in the WD3 [13] and TADD [14] algorithms are the same.Moreover, by (30), we can infer that the following equations hold: where denotes the estimation bias.We highlight our theoretical findings in the following Remarks.
Remark 4. The Q-network update rule in the WD3 [13] and TADD [14] algorithms yield the same estimation bias.
Although the WD3 [13] and TADD [14] approaches violate Remark 1 and 2, utilizing a β parameter enables the control of the underestimation bias.However, having a fixed β is a task-specific greedy approach that cannot prevent the increasing underestimation bias as the variance of the reward signals increases throughout the learning.To overcome such an issue, we uniformly sample the β parameter from an interval, the lower bound of which linearly decreases throughout the learning, consistent with the increasing variance of the reinforcement signals.
To specify the upper and lower bounds for such sampling interval, we leverage the findings in our previous work [11].In [11], we showed that the estimation error by the Triplet Critic Update remains an underestimation, the absolute value of which is significantly smaller than that of Clipped Double Q-learning [4].Although the estimation bias is not completely eliminated, the existing yet significantly decreased underestimation could dramatically improve the performance since underestimation is more preferable than overestimation [4].As our previous work [11] corresponds to β = 0.5 in ( 32) and (33), we set the upper and lower bound of the interval to 0.5 in the beginning of the learning.Then, we linearly decrease the lower bound so that the contribution of an increasing variance of the rewards is also decreased throughout the learning.As we do not know the exact values of µ i and θ, yet we are sure that θ = 0 yields overestimation, the final lower bound cannot be 0 but should be a small number, slightly larger than 0. For this, we set the final lower bound of the bias interval to a small number α = 0.05.Formally, we obtain the β parameter as: where β (t) is the sampled β value at time step t, β (t)  is the lower bound of the sampling interval at time step t, and T is the number of total training iterations.One concern with this update rule is that, as the exact estimation error cannot be known in theory, it may result in overestimation for some time steps.In addition, estimation error accumulates through subsequent updates in which Q-values are overestimated [4].Nevertheless, the accumulated error will be clipped once a β value that yields underestimation is sampled.Therefore, due to the randomness, the estimation error does not accumulate over a significant number of time steps throughout the learning, and the RL agents can tolerate such slightly overestimated Q-values [11].This forms our parameter-free update rule, Stochastic Weighted Twin Critic Update.As a result, our modification offers accurate Q-value estimates without introducing hyper-parameters and networks.We summarize our introduced approach in Algorithm 1, and the resulting algorithm built on the TD3 algorithm [4], Stochastic Weighted Twin Delayed Deep Deterministic Policy Gradient (SWTD3) in Algorithm 2. Remark 6. Due to the decreased lower bound of the β sampling interval and hence the mean of the β distribution, the introduced Q-network update rule is not affected as much as when β is fixed.

Remark 7. The estimation error induced by Stochastic
Weighted Twin Critic Update may result in overestimation for some training iterations, especially in the later stages of learning, since the lower bound of the β interval becomes very small.However, if a β value corresponding to the underestimation is sampled, the overestimation will be clipped.Hence, estimation error does not accumulate over a significant number of time steps in the SWTD3 algorithm throughout learning due to its stochastic nature.

B. Algorithmic and Complexity Comparison with the Existing Strategies
We investigate how our method differs from the previously examined approaches to the underestimation bias.First, we derive our method by assuming positively biased Q-value estimators and dependence of the approximate critics, which are mandatory and realistic in practice.These requirements were previously summarized in Remark 1 and 2, respectively.Second, our method does not introduce any hyper-parameter to be tuned in contrast to the WD3 [13] and TADD [14] algorithms that require the β parameter to be tuned, which controls the underestimation.
As we explained previously, the TD3 [4] and WD3 [13] algorithms maintain two critics while TADD [14] trains three critics.Although the Q-network objective computation requires the estimation of target Qnetworks, the behavioral Q-networks must be maintained as the soft or hard update is used to update the corresponding target networks.Moreover, the TADD algorithm [14] uses estimations of K target Q-networks in constructing the Q-network objective.Nevertheless, the time complexity of backpropagation through a network either matches or is larger than the forward propagation.Hence, we consider the time complexity as the only number of backpropagated Q-networks.Therefore, our method, TD3 [4] and WD3 [13] match in terms of the run time and are bounded by the time complexity of TADD [14].The following Remarks are made to conclude this comparison.s Remark 8. Our method introduces an analytical solution to the underestimation bias for deterministic policy gradients by considering biased Q-value estimators and dependence of the Q-networks in Clipped Double Qlearning [4], contrasting with the WD3 [13] and TADD [14] studies.Remark 9. Our method does not introduce any hyperparameter to be optimized, in contrast to WD3 [13] and TADD [14], in which the underestimation control parameter β requires to be tuned for each continuous task.
Remark 10.Time complexity of the TD3 [4], WD3 [13] and SWTD3 algorithms match and are bounded by the time complexity of the TADD algorithm [14].

VI. EXPERIMENTS
We evaluate the performance of our estimation bias correction approach by first demonstrating the estimated and actual Q-values of SWTD3 versus TD3 [4], WD3 [13] and TADD [14].Then, we evaluate the learning performances of RL agents under the SWTD3, TD3 [4], WD3 [13] and TADD [14] algorithms in MuJoCo [32] and Box2D [33] continuous control tasks interfaced by OpenAI Gym2 [17].We also consider our previous work, TCD3 [11], in our comparative evaluations for discussion.For reproducibility and a fair evaluation procedure, we directly follow the same set of tasks from MuJoCo [32] and Box2D [33] with no modifications on the environment dynamics.

A. Implementation Details and Experimental Setup
To implement the TD3 algorithm [4], we use the author's GitHub repository 3 .The implementation of TD3 [4] is the fine-tuned version of the algorithm.This version of TD3 [4] differs from the one introduced in [4] such that the number of hidden units in all networks is reduced to 256, the batch size is increased from 100 to 256, learning rates for the behavioral actor and critic Adam optimizers [34] are decreased from 10 −3 to 3 × 10 −4 , and 25000 time steps of pure exploratory policy is employed in all environments.Furthermore, we built our modification on the TD3 [4] implementation such that the target Q-value computation is replaced by Algorithm 1.To ensure stability over updates and for a consistency with our theoretical approach, the actor in SWTD3 is always optimized with respect to the first critic, as in the TD3 [4] and TCD3 [11] algorithms.
To implement the baseline algorithms, WD3 [13] and TADD [14], we use the TD3 algorithm's repository.We follow the same parameter, network, and Q-value update structures in [13] and [14] such that we replace the target Q-value computation and initialize an additional Q-network if required.For the pre-defined weight parameter β, we use the values for the environments presented in the respective papers.We manually fine-tune the β value over a training duration of 1 million time steps for ten random seeds for the rest of the environments.The values with the highest average of the last ten evaluation return over ten random seeds are chosen to train WD3 [13] and TADD [14] algorithms.Table I presents the used environment-specific weight parameter β values for the WD3 [13] and TADD [14] algorithms.Values that we fine-tune and presented in [13] and [14] are marked.
Each task in the Q-value comparisons is run for 1 million time steps, and curves are derived through the same procedure explained in section IV-B.We perform evaluations on every task by running the algorithms over 1 million time steps and evaluating the agent's performance in a distinct evaluation environment without exploration noise and learning at every 1000 time steps.Each evaluation report is an average of ten episode rewards.The results are reported over ten random seeds of the Gym [17] simulator, network initialization, and code dependencies.

B. Discussion
1) Q-value Comparisons: Actual and estimated Qvalue comparisons for our approach versus TD3 [4], WD3 [13] and TADD [14] over six OpenAI Gym [17] continuous control tasks are reported in Fig. 1, 2 and 3, respectively.SWTD3 obtains more accurate Q-value estimates than TD3 [4] and the baseline algorithms in all of the environments tested.Our empirical findings indicate several cases.First, we observe in the baseline Q-value estimations that the underestimation increases since the variance of the received reward signals grows throughout the learning, reflecting Remark 3. Second, although our method obtains fairly accurate Q-value estimates and is not affected by an increasing reward variance, the Q-values are overestimated in the initial steps.This is due to the large β values sampled at the beginning of the learning.However, as we discussed, such overestimated Q-values are tolerated by the agent, and the estimations reduce to a negligible margin of error, which verify our made claim in Remark 3.
Furthermore, we fine-tune the β value for the environments that are not reported in [13] and [14], as stated previously.Our fine-tuning results show that the corresponding β values in these environments are the same for WD3 [13] and TADD [14] since the expected function approximation error is also the same, as highlighted in Remark 4. As a result, the mean estimation errors in these environments are practically the same for WD3 [13] and TADD [14], particularly, BipedalWalker, HumanoidStandup, Humanoid, and LunarLanderContinuous.For the environments that are reported in [13] and [14], WD3 [13] obtains more accurate Q-value estimates than TADD [14] since β = 0.95 used in the TADD algorithm [14], which corresponds to a significant underestimation error due to the large contribution of the negative reward variance, as specifically shown in (33).
Our method attains substantially more accurate Q-value estimates than the competing approaches.It overcomes the effects induced by the increasing variance of the received reinforcement learning signals through sampling from an estimation error interval, the lower bound of which is constantly decreased, verifying Remark 6.
2) Evaluation: Table II reports the evaluation results in terms of the average of the last ten evaluation rewards over ten random seeds.Additionally, Fig. 4 depicts the corresponding learning curves.From our experimental results, we observe that our method either matches or outperforms the performance of TD3 [4] and baseline algorithms in terms of the learning speed and highest evaluation return.In the environments such as Bipedal-Walker, Humanoid, and LunarLanderContinuous, where our algorithm and competing approaches converge to the approximately same highest evaluation returns, Fig. 4 demonstrates that SWTD3 obtains a faster convergence by largely shrinking the underestimation bias and overcoming the increasing reward variance.Moreover, we do not observe a significant performance difference in trivial environments, e.g., InvertedDoublePendulum, InvertedPendulum, and Reacher, as they do not require complex solutions [35].
We observe that TCD3 [11], WD3 [13] and TADD [14] exhibits a better performance than TD3 [4].However, in the environments reported by [14], where β = 0.95, the performance of TADD [14] is very similar to TD3 [4] as β = 1.0 corresponds to the same expected error in TD3 [4].Furthermore, from our discussion in Remark 4 and 5, and theoretical analysis in (30), (32) and (33), we infer that TCD3 [11], WD3 [13] and TADD [14] yield approximately the same performance for β = 0.5, which is depicted in the BipedalWalker environment.In addition, when the β value of WD3 [13] is smaller than of TADD [14], it outperforms TADD [14] since a small fixed β value often corresponds to a decreased underestimation error.It exhibits the same performance in contrast, when the β values are the same.Overall, these results are consistent with our Q-value comparisons, and reflect the theoretical insights made in this study.
Ultimately, some methods exhibit a worse performance than outlined in the original articles.This is due to the stochasticity of the environment dynamics, that is, used dependencies, hardware, and random seeds have a large effect on the performance of reinforcement learning algorithms [35].Nevertheless, we use the same set of seeds for all algorithms in our experiments, and evaluation results would be consistent if we used different seeds, which suffices a fair evaluation procedure [35].This is also valid for the resulting performances when the same β value is used for WD3 [13] and TADD [14].The algorithmic differences alter the pseudorandom number order in the environment dynamics and cause the performances to differ slightly even under the same β value.Nonetheless, the overall performances are practically the same.

VII. CONCLUSION
In this paper, we focus on the underestimation of the Q-values in deterministic policy gradient [15] methods.We extend our previous work on the underestimation by theoretically addressing the infeasible assumptions in the existing approaches that prevent them from adapting to off-policy actor-critic algorithms.We support our claims through Remarks and show that receiving different reward signals that vary on a large scale increases the underestimation of the action-value estimates.Then, through an extensive analysis of the estimation bias induced by the existing approaches, we introduce our novel Deep Q-learning [16] variant that forms a linear combination of two Q-value approximators, with weights that are sampled from a shrunk estimation bias interval.Having our statistical analysis and extensive set of empirical studies combined, we demonstrate that the introduced approach notably outperforms the existing methods and improves our previous study.We also provide the exact implementation of the introduced algorithm at the GitHub repository for reproducibility concerns.

Fig. 1 .
Fig. 1.Measuring estimation bias of fine-tuned TD3 versus SWTD3 while learning on MuJoCo and Box2D environments over 1 million time steps.Estimated and true Q-values are computed through Monte Carlo simulation for 1000 samples.

SWTD3Fig. 2 .Fig. 3 .
Fig. 2. Measuring estimation bias produced WD3 versus SWTD3 while learning on MuJoCo and Box2D environments over 1 million time steps.Estimated and true Q-values are computed through Monte Carlo simulation for 1000 samples.

Fig. 4 .
Fig. 4. Learning curves for the set of OpenAI Gym continuous control tasks.The shaded region represents half a standard deviation of the average evaluation over ten trials.Curves are smoothed uniformly with a sliding window of size 10.

TABLE I WD3
AND TADD ENVIRONMENT SPECIFIC WEIGHT VALUES.

TABLE II AVERAGE
OF LAST 10 EVALUATION RETURNS OVER 10 TRIALS.BOLDFACE REPRESENTS THE MAXIMUM IN EACH TASK.±DENOTES THE SINGLE STANDARD DEVIATION OVER TRIALS.THE WD3 AND TADD ALGORITHMS USE THE BETA VALUES GIVEN INTABLE I.