The Impact of Data Distribution on Q-learning with Function Approximation

We study the interplay between the data distribution and Q-learning-based algorithms with function approximation. We provide a unified theoretical and empirical analysis as to how different properties of the data distribution influence the performance of Q-learning-based algorithms. We connect different lines of research, as well as validate and extend previous results. We start by reviewing theoretical bounds on the performance of approximate dynamic programming algorithms. We then introduce a novel four-state MDP specifically tailored to highlight the impact of the data distribution in the performance of Q-learning-based algorithms with function approximation, both online and offline. Finally, we experimentally assess the impact of the data distribution properties on the performance of two offline Q-learning-based algorithms under different environments. According to our results: (i) high entropy data distributions are well-suited for learning in an offline manner; and (ii) a certain degree of data diversity (data coverage) and data quality (closeness to optimal policy) are jointly desirable for offline learning.

The interplay between the data distribution and the outcome of the learning process is one potential source of instability of Q-learning-based algorithms (Kumar, Gupta, & Levine, 2020;Sutton & Barto, 2018).Different lines of research shed some light on how the data distribution impacts algorithmic stability.For example, some works provide examples that induce unstable behavior in off-policy learning (Baird, 1995;Kolter, 2011); some theoretical works derive error bounds on the performance of Q-learning-related algorithms (Chen & Jiang, 2019; Munos, 2005;Munos & Szepesvári, 2008); yet other studies investigate the stability of RL methods with large capacity approximators (Fu et al., 2019;Kumar, Gupta, & Levine, 2020).
We center our study around the following research question: which data distributions lead to improved algorithmic stability and performance?In the context of this work, we refer to the data distribution as the distribution used to sample experience or the distribution induced by a dataset of transitions.We investigate how different data distribution properties influence performance in the context of Q-learning-based algorithms with function approximation.We add to previous works by providing a systematic and comprehensive study that connects different lines of research, as well as validating and extending previous results, both theoretically and empirically.We primarily focus on offline RL settings with discrete action spaces (Levine, Kumar, Tucker, & Fu, 2020), in which an RL agent aims to learn reward-maximizing behavior using previously collected data without additional interaction with the environment.Nevertheless, our conclusions are also relevant in online RL settings, particularly for algorithms that rely on large-scale replay buffers.Our conclusions contribute to a deeper understanding of the influence of the data distribution properties in the performance of approximate dynamic programming (ADP) methods.
We start by presenting some background and the notation used throughout the paper in Sec. 2. We connect our work with previous lines of research in Sec. 3.Then, we investigate how the data distribution impacts the performance of ADP methods.In Sec.4.1, we review bounds on the performance of ADP methods; we highlight the close relationship between different properties of the data distribution and the tightness of the bounds and motivate high entropy distributions from a game-theoretical point of view.In Sec.4.2, we propose a novel four-state MDP specifically tailored to highlight how the data distribution impacts algorithmic performance, both online and offline.Finally, in Sec. 5, we empirically assess the impact of the data distribution on the performance of offline Q-learning-based algorithms with function approximation under different environments, connecting the obtained results with the discussion presented in Sec 4. According to our results: (i) high entropy data distributions are well-suited for offline learning; and (ii) a certain degree of data diversity (data coverage) and data quality (closeness to optimal policy) are jointly desirable for offline learning.The results in Sec. 5 are one of the main contributions of the paper.Our conclusions appear in Sec. 6.

Background
In RL (Sutton & Barto, 2018), the agent-environment interaction is modeled as an MDP, formally defined as a tuple (S, A, p, p 0 , r, γ), where S denotes the state space, A denotes the action space, p : S × A → P(S) is the state transition probability function with P(S) being the set of distributions on S, p 0 ∈ P(S) is the initial state distribution, r : S × A → R is the reward function, and γ ∈ (0, 1) is a discount factor.At each step t, the agent observes the state of the environment s t ∈ S and chooses an action a t ∈ A. Depending on the chosen action, the environment evolves to state s t+1 ∈ S with probability p(s t+1 |s t , a t ), and the agent receives a reward r t with expectation given by r(s t , a t ).A policy π is a mapping π : S → P(A) encoding the preferences of the agent.We denote by P π the |S| × |S| matrix with elements P π (s t , s t+1 ) = E a∼π(a|st) [p(s t+1 |s t , a)].A trajectory, τ = (s 0 , a 0 , ..., s ∞ , a ∞ ), comprises a sequence of states and actions.The probability of a trajectory τ under a given policy π is given by π (τ ) = p 0 (s 0 ) ∞ t=0 π(a t |s t )p(s t+1 |s t , a t ).
The discounted reward objective can be written as The objective of the agent is to find an optimal policy π * that maximizes the objective function above such that J(π * ) ≥ J(π), ∀π.RL algorithms usually involve the estimation of the optimal action-value function, Q * , satisfying the Bellman optimality equation: (1) The optimal value function, V * , can be computed from Q * as V * (s t ) = max at∈A Q * (s t , a t ).Q-learning (Watkins & Dayan, 1992) allows an agent to learn directly from raw experience in an online, incremental fashion, by estimating optimal Q-values from observed trajectories using the temporal difference (TD) update rule: Convergence to the optimal policy is guaranteed if all state-action pairs are visited infinitely often and the learning rate α is appropriately decayed.Usually, in order to adequately explore the S × A space, an -greedy policy is used.An -greedy policy chooses, most of the time, an action that has maximal Q-value for the current state, but with probability selects a random action instead.
In ADP, Q-values are approximated by a differentiable function Q φ , where φ denotes the learnable parameters of the model.ADP algorithms, such as the well-known deep Q-network (DQN) algorithm (Mnih et al., 2015), usually interleave two phases: (i) a sampling phase, where parameters φ are kept fixed and a behavior policy (e.g., -greedy policy) is used to collect transitions (s t , a t , r t , s t+1 ) and store them into a replay buffer, denoted by B; and (ii) an update phase, where transitions are sampled from B and used to update parameters φ such that is minimized, where φ − denotes the parameters of the target network, Q φ − , a periodic copy of the behavior network.The pseudocode of a generic Q-learning algorithm with function approximation can be found in Appendix A.
Offline RL (Levine et al., 2020) aims at finding an optimal policy with respect to J(π) using a static dataset of experience.The fundamental problem of offline RL is distributional shift: out-of-distribution samples lead to algorithmic instabilities and performance loss, both at training and deployment time.The conservative Q-learning (CQL) (Kumar, Zhou, Tucker, & Levine, 2020) algorithm is an offline RL algorithm that aims to estimate the optimal Qfunction using ADP techniques, while mitigating the impact of distributional shift.Precisely, the algorithm avoids the overestimation of out-of-distribution actions by considering an additional conservative penalty term of the type 0 , which the algorithm aims to minimize.Distribution ν(a|s) is chosen adversarially such that it selects overestimated Q-values with high probability, e.g., by maximizing L c .

Related Work
We now review and discuss several lines of research that, in one way or another, are related to the problem of studying the impact of data distribution on Q-learning-related algorithms.We organize our discussion around three main topics: (i) studies that derive error bounds on the performance of Q-learningrelated algorithms (Sec.3.1); works that analyze the unstable behavior in off-policy learning (Sec.3.2); and (iii) studies that investigate the stability of deep RL methods and propose algorithms for offline RL (Sec.3.3).

Error propagation in ADP
On a theoretical side, there is a number of different works that analyze error propagation in ADP methods, deriving error bounds on the performance of approximate policy iteration Kakade and Langford (2002); Munos (2003) and approximate value iteration (Munos, 2005;Munos & Szepesvári, 2008) algorithms.Munos (2003) provides error bounds for approximate policy iteration using quadratic norms, as well as bounds on the error between the performance of the policies induced by the value iteration algorithm and the optimal policy as a function of weighted L p -norms of the approximation errors (Munos, 2005).Munos and Szepesvári (2008) develop a theoretical analysis of the performance of sampling-based fitted value iteration, providing finite-time bounds on the performance of the algorithm.Yang, Xie, and Wang (2019) establish algorithmic and statistical rates of convergence for the iterative sequence of Q-functions obtained by the DQN algorithm.Chen and Jiang (2019) further improve the bounds of the previous studies while considering an offline RL setting.Common to all these works is the dependence of the derived bound on concentrability coefficients that heavily depend on the data distribution.In this work, we review concentrability coefficients in Sec.4.1, providing a motivation for the use of high entropy data distributions through the lens of robust optimization.We also analyze our experimental results, presented in Sec. 5, in light of the theoretical results from these previous articles.

Unstable behavior in off-policy learning
Several early studies analyze the unstable behavior of off-policy learning algorithms and the harmful learning dynamics that can lead to the divergence of the function parameters (Baird, 1995;Kolter, 2011;J. Tsitsiklis & Van Roy, 1997;J.N. Tsitsiklis & van Roy, 1996).For instance, Baird (1995); Kolter (2011); J.N. Tsitsiklis and van Roy (1996) provide examples that highlight the unstable behavior of ADP methods.Kolter (2011) provides an example that highlights the dependence of the off-policy distribution on the approximation error of the algorithm.In Sec.4.2, we propose a novel four-state MDP that highlights the impact of the data distribution in the performance of ADP methods.We further explore how off-policy algorithms are affected by data distribution changes, under diverse settings.We add to previous works by considering both offline settings comprising static data distributions, and online settings in which data distributions are induced by a replay buffer.

The stability of deep and offline RL algorithms
Several works investigate the stability of deep RL methods (Fu et al., 2019;Kumar, Fu, Tucker, & Levine, 2019;Kumar, Gupta, & Levine, 2020;Liu, Kumaraswamy, Le, & White, 2018;van Hasselt et al., 2018;Wang, Wu, Salakhutdinov, & Kakade, 2021;Zhang et al., 2021), as well as the development of RL methods specifically suited for offline settings (Agarwal, Schuurmans, & Norouzi, 2019;Levine et al., 2020;Mandlekar et al., 2021).For example, Kumar, Gupta, and Levine (2020) observe that Q-learning-related methods can exhibit pathological interactions between the data distribution and the policy being learned, leading to potential instability.Fu et al. (2019) investigate how different components of DQN play a role in the emergence of the deadly triad.In particular, the authors assess the performance of DQN with different sampling distributions, finding that higher entropy distributions tend to perform better.Agarwal et al. (2019) provide a set of ablation studies that highlight the impact of the dataset size and diversity in offline learning settings.Wang et al. (2021) study the stability of offline policy evaluation, showing that even under relatively mild distribution shift, substantial error amplification can occur.In Sec. 5, we provide a systematic study on how different properties of the data distribution impact the performance of deep offline RL algorithms by directly controlling the dataset generation process, allowing us to rigorously control different datasets' metrics and systematically compare our experimental results.We validate and extend previous results, as well as discuss our experimental findings in light of existing theoretical results.
Finally, Schweighofer et al. (2021) study the impact of dataset characteristics on offline RL.More exactly, the authors study the influence of the average dataset return and state-action coverage on the performance of different RL algorithms while controlling the dataset generation procedure.Despite some similarities with the experiments in Sec. 5, we present a much broader picture regarding the impact of the data distribution on the stability of general offpolicy RL algorithms, from both theoretical and experimental point-of-views.Additionally, the experimental methodology carried out by both works differs in several aspects, such as the calculation of the dataset metrics used for the presentation of the experimental results or the types of environments used.

Data distribution matters
In this section, we show that the data distribution plays an important role in regulating the performance of Q-learning-based algorithms, offering theoretical and empirical evidence to support this claim.In Sec.4.1, we review and analyze the role played by concentrability coefficients in upper error bounds of ADP methods.Then, in Sec.4.2, we propose a four-state MDP designed to highlight the impact of the data distribution on the performance of ADP methods.

Concentrability coefficients
As discussed in Sec.3.1, several works analyze error propagation in ADP (Chen & Jiang, 2019;Munos, 2003Munos, , 2005;;Munos & Szepesvári, 2008;Yang et al., 2019).Specifically, the aforementioned works provide upper bounds of the type Intuitively, the bounds correspond to ρ-weighted L p -norms (p ≥ 1) between V * /Q * , and the value/action-value function induced by the greedy policy π k with respect to the estimated value/action-value function at the k-th timestep1 .Such bounds comprise, in general, three key components: 1.A concentrability coefficient, C, that quantifies the suitability of the sampling distribution µ ∈ P(S) or µ ∈ P(S, A). 2. A measure of the approximation power of the function space, F, which reflects how well the function space is aligned with the dynamics and reward of the MDP. 3. A coefficient E that captures the sampling error of the algorithm, i.e., the error that accumulates due to limited sampling and iterations.
From the three components above, we focus our attention on the study of the concentrability coefficient as it captures the impact of the data distribution in the tightness of the upper bound.Munos (2003) introduces the first version of this data-dependent concentrability coefficient, which is related to the density of the transition probability function.Specifically, the author defines the coefficient with µ ∈ P(S) and the convention that 0/0 = 0, and C 1 = ∞ if µ(s ) = 0 and p(s |s, a) > 0. We use this convention for all upcoming coefficients.Intuitively, the noisier the dynamics of the MDP, the smaller the coefficient C 1 and the tighter the bound.Munos (2005) introduces a different concentrability coefficient related to the discounted average concentrability of future-states on the MDP.Specifically, coefficient C 2 ∈ R + ∪ {+∞} is defined as c(m) = sup π1,...,πm∈Π,s∈S with µ, ρ ∈ P(S), Π denotes the space of all possible policies, and ρ reflects the importance of various regions of the state space and is selected by the practitioner.Intuitively, coefficient C 2 expresses some smoothness property of the future state distribution with respect to µ for an initial distribution ρ.Munos (2005) and Munos and Szepesvári (2008) note that the assumption that C 1 < ∞ is stronger than the assumption that C 2 < ∞.Farahmand, Szepesvári, and Munos (2010) and Yang et al. (2019) replace the supremum norm of (4) with a weighted norm of the type with µ, ρ ∈ P(S × A).We let C 3 ∈ R + ∪ {+∞} denote the coefficient defined by ( 3) and ( 5).
Unfortunately, although the concentrability coefficients above attempt to quantify distributional shift, they have limited interpretability.Specifically, it is hard to infer from the coefficients above which exact sampling distributions should be used.For example, if we consider coefficients C 2 and C 3 , even if we know which parts of the state space are relevant according to the distribution ρ, the computation of the coefficient in (3) still depends on the complex interactions between ρ and the dynamics of the MDP under any possible policy.What can be concluded is that the concentrability coefficient will depend on all states that can be reached by any policy when the starting state distribution is given by ρ.However, it is not obvious which exact target distribution we should aim at when selecting the sampling distribution µ, especially when we do not have access to the full specification of the underlying MDP or which regions of the state space are of interest.In the face of such uncertainty, previous works assume sufficient coverage of the state (and action) space, thus using upper bounded concentrability coefficients to analyze the performance of the algorithms (Chen & Jiang, 2019;Farahmand et al., 2010;Munos & Szepesvári, 2008;Yang et al., 2019).
More recently, Chen and Jiang (2019) revisit the assumption of a bounded concentrability coefficient and formally justify the necessity of mild distribution shift via an information-theoretic lower bound.Precisely, the authors show that, under a bounded concentrability coefficient defined using (4), polynomial sample complexity is precluded if the MDP dynamics are not restricted.In subsequent work, Xie and Jiang (2020) break the hardness conjecture introduced by Chen and Jiang (2019) albeit under a more restrictive concentrability coefficient similar to that of (2).Both works highlight the dependence of the concentrability coefficient on the properties of the MDP.
Other works (Amortila, Jiang, & Xie, 2020;Wang, Foster, & Kakade, 2020;Zanette, 2020) have also proved hardness results for offline RL, however, under an even weaker form of concentrability than that induced by Eqs. ( 2) and (4).For example, Wang et al. (2020) show that good coverage over the feature space is not sufficient to sample-efficiently perform offline policy evaluation with linear function approximation and that significantly stronger assumptions on distributional shift may be needed.We refer to Xie and Jiang (2020) for a detailed discussion on the relation between the different proposed concentrability coefficients and hardness results for offline RL.
We conclude by noting that offline policy evaluation allows for more interpretable and amenable quantification of distributional shift; under such setting, distributional shift can be more accurately quantified in terms of a statistical distance between sampling distribution µ and the stationary distribution of the target policy, as recently proposed by Duan and Wang (2020).
In the next section, we provide an interpretation of C 3 as an f -divergence and give a new motivation for the use of maximum entropy sampling distributions from a game-theoretical point of view.

Motivating Maximum Entropy Distributions
Letting β = ρP π1 P π2 ...P πm , we can rewrite (5) as where D f denotes the f -divergence.Optimizing (5) over the distribution µ is hard due to the fact that we actually want to minimize D f (β||µ) with respect to a large set of different β distributions, due to the supremum in (5), as well as the summation in (3).Furthermore, we usually do not know the transition probability function.Therefore, we analyze the problem of picking an optimal µ distribution as a robust optimization problem.Specifically, we formulate a minimax objective where the minimizing player aims at choosing µ to minimize D f (β||µ) and the maximizing player chooses β to maximize D f (β||µ).
Proposition 1 Let P(S ×A) represent the set of probability distributions over S ×A.
Let also Lµ : . The solution µ to arg min is the maximum entropy distribution over S × A. Proof in Appendix B.
As stated in Proposition 1, the maximum entropy distribution is the solution to the robust optimization problem.This result provides a theoretical justification for the benefits of using high entropy sampling distributions, as suggested by previous works (Kakade & Langford, 2002;Munos, 2003): in the face of uncertainty regarding the underlying MDP, high entropy distributions ensure coverage over the state-action space, thus contributing to keep concentrability coefficients bounded.Fig. 1: Four-state MDP, with states {s 1 , s 2 , s 3 , s 4 } and actions {a 1 , a 2 }.State s 1 is the initial state and states s 3 and s 4 are terminal (absorbing) states.All actions are deterministic except for the state-action pair (s 1 , a 1 ), where p(s 3 |s 1 , a 1 ) = 0.99 and p(s 2 |s 1 , a 1 ) = 0.01.The reward function is r(s 1 , a 1 ) = 100, r(s 1 , a 2 ) = −10, r(s 2 , a 1 ) = −35 and r(s 2 , a 2 ) = 30.
The bounds surveyed in this section suggest that high coverage over the state-action space and high entropy distributions are beneficial.However, it is important to note that the significance of the previous results is highly dependent on the actual tightness of the bound (Munos, 2005); rather loose bounds can trivially upper bound the error but be of little help to understanding algorithmic behavior.Thus, it is important to understand, from a practical point-of-view, if the properties suggested by the surveyed bounds contribute to improved algorithmic performance.Therefore, we investigate, from an experimental point-of-view, how the data distribution impacts performance: (i) in the next section (Sec.4.2) under our four-state MDP; and (ii) in Sec. 5 under high dimensional environments and two Q-learning-based algorithms.

Four-state MDP
We now study how the data distribution influences the performance of a Qlearning algorithm with function approximation under the four-state MDP (Fig. 1).We show that the data distribution can significantly influence the quality of the resulting policies and affect the stability of the learning algorithm.Due to space constraints, we focus our discussion on the main conclusions and refer to Appendix B for an in-depth discussion.
We focus our attention to non-terminal states s 1 and s 2 and set γ = 1.In state s 1 the correct action is a 1 , whereas in state s 2 the correct action is a 2 .We consider a linear function approximator Q w (s t , a t ) = w T φ(s t , a t ), where φ is a feature mapping, defined as φ(s 1 , a As can be seen, the capacity of the function approximator is limited and there exists a correlation between Q w (s 1 , a 1 ) and Q w (s 2 , a 1 ).This will be key to the results that follow.

Offline Learning
We consider an offline RL setting and denote by µ the distribution over S × A induced by a static dataset of transitions.We focus our attention on probabilities µ(s 1 , a 1 ) and µ(s 2 , a 1 ), since these are the probabilities associated with the two partially correlated state-action pairs.Fig. 2 displays the influence of the data distribution, namely the proportion between µ(s 1 , a 1 ) and µ(s 2 , a 1 ), on the number of correct actions yielded by the learned policy.We identify three regimes: (i) when µ(s 1 , a 1 ) ≈ 0.5, we learn the optimal policy; (ii) if µ(s 1 , a 1 ) < (≈ 0.48) or (≈ 0.52) < µ(s 1 , a 1 ) < (≈ 0.65), the policy is only correct at one of the states; (iii) if µ(s 1 , a 1 ) > (≈ 0.65), the policy is wrong at both states.The results above show that, due to the limited power and correlation between features, the data distribution impacts performance as the number of correct actions is directly dependent on the properties of the data distribution.As our results show, due to bootstrapping, it is possible that under certain data distributions neither action is correct.

Online Learning with Unlimited Replay
Instead of considering a fixed µ distribution, we now consider a setting where µ is dynamically induced by a replay buffer obtained using -greedy exploration.Figure 3 shows the results when α = 1.2, under: (i) an -greedy policy with = 1.0; and (ii) an -greedy policy with = 0.05.We consider a replay buffer with unlimited capacity.We use a uniform data distribution as baseline2 .As seen in Fig. 3, the baseline outperforms all other data distributions, as expected given our discussion in the previous section.Regarding thegreedy policy with = 1.0, the agent is only able to pick the correct action at state s 1 , featuring a higher average Q-value error in comparison to the baseline.This is due to the fact that the stationary distribution of the MDP under the fully exploratory policy is too far from the uniform distribution to retrieve the optimal policy.Finally, for the -greedy policy with = 0.05, the performance of the agent further deteriorates.Such exploratory policy induces oscillations in the Q-values (Fig. 3 (b)), which eventually damp out as learning progresses.The oscillations are due to an undesirable interplay between the features and the data distribution: exploitation may cause abrupt changes in the data distribution and hinder learning.

Online Learning with Limited Replay Capacity
Finally, we consider an experimental setting where the replay buffer has limited capacity and study the impact of its size in the stability of the algorithm.Figure 4 displays the results obtained with the -greedy policy with = 0.05, while varying the capacity of the replay buffer.As can be seen, as the replay buffer size increases the oscillations in the Q-values errors are smaller.The undesirable interplay previously observed under the infinitelysized replay buffer repeats.However, the smaller the replay buffer capacity, the more the data distribution induced by the contents of the replay buffer is affected by changes to the current exploratory policy, i.e., exploitation leads to more abrupt changes in the data distribution, which, in turn, drive abrupt changes to the Q-values.For the infinitely-sized replay buffer the amplitude of the oscillations is dampened because previously stored experience contributes to make the data distribution more stationary, not as easily achieved by smaller buffers.

Discussion
We presented a set of experiments using a four-state MDP that shows how the data distribution can influence the performance of the resulting policies and the stability of the learning algorithm.First, we showed that, under an offline RL setting, the number of optimal actions identified is directly dependent on the properties of the data distribution due to an undesirable correlation between features.Second, not only the quality of the computed policies depends on the data collection mechanism, but also an undesirable interplay between the data distribution and the function approximator can arise: exploitation can lead to abrupt changes in the data distribution and hinder learning.Finally, we showed that the replay buffer size can also affect the learning dynamics.Despite the fact that we study a four-state MDP, we argue that the example here presented is still relevant under more realistic settings; we further elaborate on this point in Appendix B.

Assessing the Impact of Data Distribution in Offline RL
In this section, we experimentally assess the impact of different data distribution properties on the performance of offline DQN (Mnih et al., 2015) and CQL (Kumar, Zhou, et al., 2020).We evaluate the performance of the algorithms under six different environments: the grid 1 and grid 2 environments consist of standard tabular environments with highly uncorrelated state features, the multi-path environment is a hard exploration environment, and the pendulum, mountaincar and cartpole environments are benchmarking environments featuring a continuous state-space domain.All reported values are calculated by aggregating the results of different training runs.The description of the experimental environments and the experimental methodology, as well as the complete results, can be found in Appendix C. The developed software can be found at https://github.com/PPSantos/rl-data-distribution-public.We also provide an interactive dashboard with all our experimental results at https://rldatadistribution.pythonanywhere.com/.
In this section, we denote by µ the data distribution over state-action pairs induced by a static dataset of transitions.We consider two types of offline datasets: (i) -greedy datasets, generated by running an -optimal policy on the MDP, i.e., a policy that is -greedy with respect to the optimal Qvalues, with ∈ [0, 1]; and (ii) Boltzmann(T ) datasets, generated by running a Boltzmann policy with respect to the optimal Q-values with temperature coefficient T ∈ [−10, 10].Additionally, we artificially enforce that some of the generated datasets have full coverage over the S × A space.We do this by running an additional procedure that ensures that each state-action pair appears at least once in the dataset.We chose not to use publicly available datasets for offline RL (Fu, Kumar, Nachum, Tucker, & Levine, 2020;Gülçehre et al., 2020;Qin et al., 2021) in order to have complete control over the dataset generation procedure, which allows us to rigorously control different datasets' metrics and systematically compare our experimental results.Nevertheless, our results are representative of a diverse set of discrete action-space control tasks.Two aspects are worth highlighting.First, in all environments, the sampling error is low due to the highly deterministic nature of the underlying MDPs.Thus, a single next-state sample is sufficient to correctly evaluate the Bellman optimality operator (Eq.( 1)).Second, the function approximator has enough capacity to correctly represent the optimal Q-function, a property known as realizability (Chen & Jiang, 2019).

High entropy is beneficial
We start our analysis by studying the impact of the dataset distribution entropy, H(µ), on the performance of the offline RL algorithms.Figure 5 displays the average normalized rollouts reward for datasets with different normalized entropies.As can be seen, under all environments and for both offline RL algorithms, high entropy distributions tend to achieve increased rewards.In other words, distributions with a large entropy appear to be well-suited to be used in offline learning settings.Such observation is inline with the discussion in Sec.4.1 and works such as (Kakade & Langford, 2002;Munos, 2003): high entropy distributions contribute to increased coverage, keeping concentrability coefficients bounded and, thus, mitigating algorithmic instabilities.
Importantly, we do not claim that high entropy distributions are the only distributions suitable to be used.As can be seen in Fig. 5, certain lower-entropy distributions also perform well.In the next sections, we investigate which other properties of the distribution are of benefit to offline RL.

Dataset coverage matters
We now study the impact of dataset coverage, i.e., the diversity of the transitions in the dataset, in the performance of the offline agents.In order to keep the discussion concise, in this section we focus our attention on -greedy datasets, and refer to Appendix C for the complete results.We start by focusing our attention on the offline DQN algorithm.Figure 6 (a) displays the average normalized rollouts reward under -greedy datasets with dataset coverage not enforced.As can be seen, DQN struggles to achieve optimal rewards for low values of , i.e., even though the algorithm is provided with optimal or near-optimal trajectories, it is unable to steadily learn under such setting.However, as increases, the performance of the algorithm increases, eventually decaying again for high values.Such results suggest that a certain degree of data coverage is required by DQN to robustly learn in an offline manner, despite being provided with high quality data (rich in rewards).On the other hand, the decay in performance for highly exploratory policies under some environments can be explained by the fact that such policies induce trajectories that are poor in reward (this is further explored in the next section).Figure 6 (b) displays the obtained experimental results under the exact same datasets, except that we enforce coverage over S × A. We note a substantial improvement in the performance of DQN across all environments, supporting our hypothesis that data coverage plays an important role regulating the stability of offline RL algorithms.
The CQL algorithm appears to perform more robustly than DQN.Particularly, as seen in Fig. 6 (a), CQL is able to robustly learn with low values, i.e., using optimal or near optimal trajectories that feature low coverage.Additionally, no substantial performance gain is observed under the offline CQL agent by enforcing dataset coverage (Fig. 6 (b)).
The finding that data coverage appears to play an important role regulating the performance of DQN, even when considering high quality, near optimal trajectories, is inline with the discussion presented in Sec.4.1.One could argue   that we are only interested in correctly estimating the Q-values along an optimal trajectory, however, due to the bootstrapped nature of the updates, error in the estimation of the Q-values for adjacent states can erroneously affect the estimation of the Q-values along the optimal trajectory.The argument above is suggested by concentrability coefficients.If we consider distribution ρ from (4) or ( 5) to be the uniform distribution over the states of the optimal trajectory and zero otherwise, we can see that the concentrability coefficient given by (3) still depends on other states than those of the optimal trajectory.Precisely, the coefficient depends on all the states that can be reached by any policy when the starting state is sampled according to ρ (the importance of each state geometrically decays depending on their distance to the optimal trajectory).Therefore, in order to keep the concentrability coefficient low, it is important that such states are present in the dataset.On the other hand, CQL is still able to robustly learn using high quality trajectories independently of the data coverage because of its pessimistic nature.Since the algorithm penalizes the Q-values for actions that are underrepresented in the dataset, the error for adjacent states is not propagated in the execution of the algorithm.
In this section, we considered datasets that are, in general, aligned and close to that induced by optimal policies.What happens if our data is collected by arbitrary policies?We investigate the impact of the trajectory quality in the next section.

Closeness to optimal policy matters
We now investigate how offline agents are affected by the quality of the trajectories contained in the dataset.More precisely, we study how the statistical distance between distribution µ and the distribution induced by one of the optimal policies of the MDP, d π * , affects offline learning.
The obtained experimental results are portrayed in Fig. 7, which shows the average normalized rollouts reward for different distances between µ and the closest distribution induced by one of the optimal policies, d π * .We consider a wide spectrum of behavior policies, from optimal to anti-optimal policies (i.e., Boltzmann policies with low T values), as well as from fully exploitatory to fully exploratory policies.As can be seen, as the statistical distance between distribution µ and the closest distribution induced by one of the optimal policies increases, the lower the rewards obtained, irrespectively of the algorithm.We can also observe an increase in obtained rewards when dataset coverage is enforced (Fig. 7 (b)) in comparison to when dataset coverage is not enforced (Fig. 7 (a)).
At first sight, our results appear intuitive if we focus on Fig. 7 (a), where dataset coverage is not enforced: if the policy used to collect the dataset is not good enough, it will fail to collect trajectories rich in rewards, key to learn reward-maximizing behavior.As an example, if the policy used to collect the data is highly exploratory, the agent will likely not reach high rewarding states and the learning signal may be too weak to learn an optimal policy.
However, the results displayed in Fig. 7 (b), in which dataset coverage is enforced, reveal a rather less intuitive finding: despite the fact that all datasets feature full coverage over S × A, if the statistical distance between the two distributions is high, we observe a deterioration in algorithmic performance.In other words, despite the fact that the datasets contain all the information that can be retrieved from the environment (including transitions rich in reward), offline learning can still struggle if the behavior policy is too distant from the optimal policy.Such observation can be explained by the fact that distributions far from the optimal policy prevent the propagation of information, namely Q-values, during the execution of the offline RL algorithm.
Given the experimental results presented in this section, it is important for the data distribution to be aligned with that of optimal policies, not only to ensure that trajectories are rich in reward, but also to mitigate algorithmic instabilities.Our experimental results suggest that the assumption of a bounded concentrability coefficient, as discussed in Sec.4.1, may not be enough to robustly learn in an offline manner and that more stringent assumptions on the data distribution are required.Wang et al. (2020) reach a similar conclusion, from a theoretical perspective.

Discussion
This section experimentally assessed the impact of different data distribution properties in the performance of offline Q-learning algorithms with function approximation, showing that the data distribution greatly impacts algorithmic performance.In summary, our results show that: (i) high entropy data distributions are well-suited for learning in an offline manner; (ii) a certain degree of data diversity (data coverage) is desirable for offline learning; and (iii) a certain degree of data quality (closeness to distributions induced by optimal policies) is desirable for offline learning.
Finding (i) is aligned with the discussion in Sec.4.1: in the absence of detailed information regarding the underlying MDP, high entropy distributions contribute to high coverage over the state-action space, thus yielding bounded concentrability coefficients (an assumption widely adopted by the works surveyed in Sec.4.1).However, as our experiments in Sec.5.3 show, full coverage (equivalent to having bounded concentrability coefficients) is not enough to learn optimal policies.Thus, we hypothesize that the advantages of using high entropy distributions not only come from the fact that they yield high coverage over the state-action space, but also because they induce smooth distributions that mitigate information bottlenecks during algorithm execution, allowing Qvalues to easily propagate according to the MDP dynamics.This hypothesis is supported by the fact that, as we show in Proposition 1, the maximum entropy distribution is the one that minimizes the statistical distance to all other possible distributions.
Regarding finding (ii), a certain degree of coverage is necessary, even when the data is collected by an optimal policy, due to bootstrapping, as already discussed in Sec.5.2.However, it is interesting to note that, according to our results (Table 1), it is not necessary to have full coverage over the state-action space to learn optimal behavior.This finding is supported by the discussion in Sec.4.1: as suggested by Eqs. ( 3) and (4), the states that have positive probability according to the distribution induced by the optimal policy are those we care the most about; all other states are exponentially less important depending on their distance to the states belonging to the optimal trajectory (in terms of the number of MDP transitions).This contrasts with the theoretical works surveyed in Sec.4.1 that, in the absence of knowledge regarding the exact underlying MDP and the quality of the offline trajectories provided, assume bounded concentrability coefficients, i.e., full coverage over the state-action space.
Finally, according to finding (iii), it appears to also be important that the distribution induced by the dataset is not very far from the distribution induced by one of the optimal policies of the MDP, even if all state-action pairs are present in the dataset.Again, we hypothesize that this finding can be explained by the fact that certain distributions prevent the propagation of Qvalues throughout the iterations of the algorithm.Further investigations should be carried out in order to understand if this problem can be circumvented by using more sophisticated sampling techniques such as prioritized replaying.We leave such research direction for future work.

On the optimism vs pessimism tradeoff
As seen in Fig. 7, the performance of DQN and CQL is dependent on whether coverage is enforced or not.When dataset coverage is not enforced (Fig. 7 (a)), CQL outperforms DQN, especially for distributions close to that of optimal policies (as can also be seen in Fig. 6 (a)).However, when coverage is enforced (Fig. 7 (b)), DQN outperforms CQL, especially for distributions that are more distant to that of optimal policies, as can be seen in Fig. 7 (b).This is due to the fact that both algorithms balance the tradeoff between optimism and pessimism in different ways.DQN is very optimistic and fails under low-coverage settings, since it propagates erroneous Q-values during the execution of the algorithm.However, due to its optimistic nature, it outperforms CQL when coverage is enforced, taking advantage of information that is underrepresented in the dataset.CQL, on the other hand, outperforms DQN under low coverage settings since, due to its pessimistic nature, prevents the propagation of erroneous Q-values.However, when valuable but underrepresented information is present in the dataset, the pessimism of CQL prevents learning, and CQL is outperformed by DQN.
On the impact of the sampling error, approximation capacity, and generalization hardness In our experiments, we consider environments featuring low sampling error, i.e., the environments have almost deterministic transitions.We also consider large-capacity function approximators, i.e., the approximators have enough capacity to represent the optimal Q-function exactly.Naturally, we expect our results to change if these assumptions were to change.Namely, more samples per state-action pair would be required as the stochasticity of the environment increases.If the capacity of the approximator decreases, we expect the approximator to need to focus on a subset of state-action pairs that are more relevant to optimal behavior rather than correctly estimating the Q-values for all state-action pairs.Finally, in order to better study the impact of different data distributions properties in the performance of ADP-related algorithms, we considered three environments, namely the grid 1, grid 2, and multi-path environments, which feature highly uncorrelated features.We expect the performance of ADP-related algorithms to be less affected by changes to the data distribution under rather smoother features.Nevertheless, our main findings appear to be consistent for the remainder environments, namely the pendulum, cartpole and mountaincar environments, which comprise features that allow to more easily generalize across the state-action space.

Conclusion
In this work, we investigate the interplay between the data distribution and Q-learning-based algorithms with function approximation.We analyze how different properties of the data distribution affect performance in both online and offline RL settings.We show, both theoretically and empirically, that: (i) high entropy data distributions contribute to mitigate sources of algorithmic instability; and (ii) different properties of the data distribution influence the performance of RL methods with function approximation.We provide a thorough experimental assessment of the performance of both DQN and CQL algorithms under several types of offline datasets, connecting our experimental results with the theoretical findings of previous works.
The experimental results presented herein provide useful insights for the development of improved data processing techniques for offline RL, which should be valuable for future research.For example, our results suggest that maximum entropy exploration methods (Hazan, Kakade, Singh, & Soest, 2018) can be well suited for the construction of datasets for offline RL, naive dataset concatenation can lead to deterioration in performance, and that, by simply reweighting or discarding of training data, it is possible to substantially improve performance of offline RL algorithms.Antos, A., Szepesvari, C., Munos, R. (2008) As the last equality states, the f -divergence above corresponds to a particular type of divergence known as the χ 2 -divergence (Liese & Vajda, 2006).
Below, we prove Proposition 1.We reproduce the proposition's statement before the proof for ease of consultation.
is the uniform distribution over the state-action space.
Proof Suppose that, as discussed, the maximizing player (adversary) has access to a distribution µ, chosen by the minimizing player.The adversary's objective is therefore to maximize Lµ over the probability simplex.We begin to show what is the solution of such maximization.We start by noting that Lµ, being a norm, is a convex real function.Lµ is also continuous.Additionally, the probability simplex P(S × A) is a compact set, since it is both closed and bounded.Under the three conditions just mentioned, Bauer's Maximum Principle guarantees that the solution of the maximization lies on the subset of extreme points of the admissible region.Such set is, in the case of the probability simplex, the subset of probability distributions with a singleton support set.Equivalently, the adversary is to choose a distribution β such that, for some pair (s, a), β(s, a) = 1 and is zero otherwise.Formally, we can write Finally, the minimizing player's best choice of µ is the one minimizing the quantity above.We note that max Figure B1 displays the relationship between the expected entropy of the sampling distribution µ and: (i) the coverage of different datasets constructed using µ; and (ii) the mean χ 2 -divergence to all other distributions.The plots are computed using randomly sampled distributions, which we sample from different Dirichlet distributions.We control the expected entropy of the resulting distributions by varying the Dirichlet parameter α.As seen in Fig. B1 (a), irrespectively of the dataset size, higher entropy distributions lead to datasets featuring higher coverage over the distribution support.As seen in Fig. B1 (b), higher entropy distributions yield, on average, a lower χ 2 -divergence to all other sampled distributions in comparison to lower entropy distributions.

B.2 Supplementary Materials for Section 4.2
In this section we study how the data distribution can influence the performance of Q-learning based RL algorithms with function approximation, under a four-state MDP (Fig. B2).The main objective of this section is to show that the data distribution plays an active role regulating algorithmic stability.We show that the data distribution can significantly influence the quality of the resulting policies and affect the stability of the learning algorithm.We consider both online and offline RL settings.We finish this section by summarizing the key insights and providing a discussion on how the example presented here, as well as the respective findings, can generalize into bigger and more realistic MDPs.
As can be seen, due to the choice of feature mapping, the capacity of the function approximator is limited and there exists a correlation in the features between Q w (s 1 , a 1 ) and Q w (s 2 , a 1 ).This will be key to the results that follow.

B.2.1 Offline Oracle Version
We start by assuming that we have access to an oracle providing us with the exact optimal Q-function and write the loss of the function approximator as µ(s1, a1) The number of correct actions at states s 1 and s 2 of the fourstate MDP for different data distributions.The number of correct actions is calculated using Eqs.B7 for different µ(s 1 , a 1 ) and µ(s 1 , a 2 ) proportions.
Akin to the oracle version (Eqs.B7), the data distribution plays a key role in the performance of the resulting policies.However, for the present temporal   since we now wrongly estimate Q w (s 1 , a 1 ) < Q w (s 1 , a 2 ), exploitation leads to an increase in probabilities µ(s 1 , a 2 ) and µ(s 2 , a 1 ), driving weight w 1 down, until µ(s 1 , a 1 ) is low enough such that we correctly estimate Q w (s 1 , a 1 ) > Q w (s 1 , a 2 ) again.As can be seen in Fig. B6 (c), after an initial increase, we observe a decrease in probability µ(s 1 , a 1 ), as well as increase in probability µ(s 2 , a 1 ), between episodes 2500 and 5000; The increase in probability µ(s 2 , a 1 ) is driven by the fact that Q w (s 2 , a 1 ) > Q w (s 2 , a 2 ) always verifies (Fig. B7  (b)); (v) the described interplay repeats again.However, the oscillation is now dampened by the fact that the replay buffer is already partially-filled with previous experience.

B.2.4 Online Temporal Difference Version with Limited Replay Capacity
Lastly, we consider an experimental setting where the replay buffer has limited capacity.Figures B8 and B9 display the experimental results obtained with the -greedy exploratory policy with = 0.05 by varying the size/capacity of the replay buffer.As can be seen in Fig. B9, as the replay buffer size increases, the amplitude of the oscillations in the µ distribution gets smaller.Moreover, as the replay buffer size increases, the oscillations in the Q-values and Qvalues errors are smaller, as seen in Fig. B8.Given the previous discussion, the obtained results are expected.The undesirable interplay between the function approximator and the data distribution repeats as previously discussed for Fig. B6: Four-state MDP experiments for different exploratory policies with an infinitely-sized replay buffer.The plots display the data distribution probabilities µ(s 1 , a 1 ) and µ(s 2 , a 1 ), as induced by the contents of the replay buffer, throughout episodes.
the infinitely-sized replay buffer.However, as the replay buffer gets smaller, the more the data distribution induced by the contents of the replay buffer is affected by changes to the current exploratory policy, in this case the -greedy exploratory policy.Therefore, for smaller replay buffers, exploitation leads to more steep changes in µ(s 1 , a 1 ) and µ(s 2 , a 1 ) probabilities, as seen in Fig. B9 (a).Such changes in the µ distribution drive abrupt changes in weights w 1 and w 2 , as well as estimated Q-values, as seen in Fig. B8 (a).Similarly to the infinitely-sized replay buffer, the agent keeps alternating between phases where it estimates that Q w (s 1 , a 1 ) > Q w (s 1 , a 2 ) and phases where it estimates the opposite.However, the period at which phases switch is rather longer for smaller replay buffer sizes.Whereas for the infinitely-sized replay buffer the amplitude of the oscillations is dampened due to the fact that previously stored experience contributes to make the data distribution more stationary, this is not as easily achieved by smaller replay buffers.As our results suggest, the size of the replay buffer influences the stability of the data distribution, which can, in turn, affect the stability of the learning algorithm and quality of the resulting policies.

B.2.5 Discussion
In summary, this manuscript presented a set of experiments under a four-state MDP that show how the data distribution can greatly influence the performance of the resulting policies and the stability of the learning algorithm.First, we showed that, under offline RL settings, the number of correct actions

Fig. 2 :
Fig. 2: The number of correct actions at states s 1 and s 2 for different data distributions (α = 1.25).

Fig. 7 :
Fig. 7: Average normalized rollouts reward; the x-axis encodes the statistical distance between µ and the closest distribution induced by one of the optimal policies, d π * .
Distance to all other distributions.Fig.B1: The relationship between the expected entropy of the distribution µ and: (i) the dataset coverage; and (ii) the χ 2 -divergence to all other distributions.
Q-values mean error.

Fig. B5 :
Fig. B5: Four-state MDP experiments for different exploratory policies with an infinitely-sized replay buffer.
Four-state MDP experiments for the -greedy exploratory policy with = 0.05 and an infinitely-sized replay buffer.The plots display the estimated weights and Q-values throughout episodes.

Fig. B8 :
Fig. B8: Four-state MDP experiments for the -greedy exploratory policy with = 0.05 under different replay buffer sizes.The plots display the estimated Q-values throughout episodes.

Table 1 :
Performance metrics for the grid 1 and mountain car environments under the DQN algorithm (coverage not enforced).For reference, the maximum average Q-values error recorded across all tested dataset types under the grid 1 and mountain car environments are, respectively, 2.45 × 10 5 and 50.95.
. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path.Machine Learning, 71 , 89-129.