1 Introduction

Recently, deep reinforcement learning (DRL) has received increasing interest in RS due to its capability in capturing users’ dynamic interests [1]. Current DRL-based RS can be generally categorized into three streams: value-based methods, policy-based methods, and hybrid methods. One of the representatives of value-based methods would be Deep Q-learning (DQN), which [2] brought into news recommendation. However, deep Q-learning-based methods require the “maximize” operation over the action space (i.e., all the candidate items), which is not traceable and may induce agent stuck problem [3]. Policy-gradient methods can mitigate such problems but suffer from the high variance problem as the optimization is based on the last step’s trajectory, which could be distinct from previous trajectories [4]. The hybrid method is a combination of policy-gradient and value-based methods. It aims to reduce the variance for policy-gradient by introducing the value-based method [5] and has gained more attention [6,7,8,9].

However, the user-item interactions are commonly sparse, hindering policy optimizations to find rewards via exploration as well as maximizing performance via exploitation. Specifically, as DRL relies on carefully engineering environment rewards that are extrinsic to the agents, the sparsity barely provides dense reward signals (i.e., most of the reward signals might be missing because of the highly incomplete interactions and user feedback). Hence, new exploration strategies would be a choice to encourage agents to discover a wider range of states and formulate richer interaction trajectories [10]. Recent literature [11, 12] shows that exploration is effective in reducing model uncertainty in regions of sparse rewards or user interactions. Moreover, most of the existing works in DRL RS apply \(\epsilon \)-greedy as the exploration strategy, where the agent has \(\epsilon \) possibility to conduct exploration randomly [1]. However, random exploration increases training time and uncertainty and may not be able to explore enough informative interaction trajectories. Moreover, it will cost a considerable number of trials, making it infeasible to apply to highly sparse user feedback in recommender systems, as it also requires a significant number of trials for exploitation, which is known as the exploration and exploitation trade-off.

Differently, several attempts have been made from a different perspective of data augmentation to relieve the sparsity. Experience replay is widely used in DRL methods, empowering agents to learn by reusing past interaction trajectories. However, experience replay can only promote certain trajectories to be replayed [13]. The policy learning process may be harmed if the generated trajectories are not informative. Recent studies also investigate equipping data augmentation with causality by governing to generate informative trajectories. For instance [14], designs a simple counterfactual method by measuring the embedding change to generate a new user sequence.

Moreover [15], consider the embedding to contain two parts, which are dispensable or indispensable items related to the final recommended items, by leveraging causality. By replacing dispensable items, it can generate more user sequences but with the same performance. The main limitation with these approaches, however, is that they assume an embedding is fixed and never changes, while the embedding should be dynamic and updated after each interaction. Moreover, the agent never knows the ground truth (i.e., the user’s final choice) during online interactions. Hence, it is impractical to leverage the ground truth to determine the embedding difference or indispensable items as existing works have done.

It should be noted that although various causal data augmentation techniques have been successfully implemented for sequential or traditional recommendation problems, none have yet been applied to DRL-based methods. In DRL, the agent learns policies from previously collected trajectories, but the collection process is often a costly bottleneck. Given the potential benefits of causal data augmentation in recommender systems, it is worth exploring its applicability to DRL methods. However, the existing methods cannot be directly transferred to DRL, as they are not designed for the same learning paradigm. Two fundamental reasons for this are: (i) existing methods focus on embeddings, whereas the traditional DRL method does not involve any embedding or similar concept. However, some works on RS consider the state representation as an embedding to construct the entire pipeline. This approach requires an individual representation network, which can be computationally inefficient and expensive as extra hyperparameters are introduced. , and (ii) existing methods assume embeddings are constant and do not change, whereas user interests in DRL are dynamic and may shift after the system provides recommendations or over time.

Furthermore, most of the existing DRL-based methods utilize random exploration [1], which may not be well-suited for recommendation tasks. This is due to two reasons: i). firstly, traditional DRL algorithms are intended for countable state spaces with limited potential states. However, in recommender systems, the state space is uncountable [1], and random exploration may fail to reach useful states in certain episodes. ii). Moreover, random exploration can generate a multitude of uninformative trajectories, which are stored in the replay buffer and could impede the training process. DRL methods heavily rely on the replay buffer to acquire knowledge from previous interactions [16].

In order to address the above issues, we propose a new end-to-end model, namely Intrinsically Motivated Reinforcement Learning with Counterfactual Augmentation (IMRL), from two aspects: augmenting informative trajectories and a new exploration strategy. We design a novel empowerment-based exploration strategy to encourage the agent to explore potentially informative interaction trajectories in the sparse environment. Moreover, we elaborate on a new counterfactual data augmentation method for DRL RS to augment those newly explored informative trajectories so that they can have a higher exposure probability, thus boosting the final performance.

In summary, we make the following contributions in this paper:

  • We propose a novel DRL method, IMRL, which can augment trajectories that are causally valid but never seen by the agent to alleviate the data sparsity problem. Moreover, we also introduce an adaptive threshold to dynamically control the boundary of the informative trajectories as the learning process in DRL is evolutionary.

  • We designed an empowerment-driven exploration strategy for IMRL to help explore those unexplored but potentially informative interaction trajectories. Our experiments show that the designed exploration strategy can boost the final performance in the online simulation platforms.

  • We have conducted extensive experiments in both offline and online settings and shown the superiority of IMRL. We conducted offline experiments with six well-known datasets and online experiments in three public simulation platforms.

Differences with our previous work [17]. Building upon our previous work [17], we present a deeper investigation of the mechanism underlying our proposed data augmentation method in this study. Specifically, we introduce a novel threshold parameter \(T_{\max }\) to enable adaptive control over the data augmentation process, as we have discovered that informative trajectories depend on the training process. Additionally, we provide a comprehensive analysis of each component of our proposed method. To further validate the efficacy of our approach, we conduct experiments on two additional offline datasets and two online environments.

2 Background

In this section, we will provide some background on the proposed work, which can be divided into two parts: DRL RS problem formulation and causality. Firstly, we will briefly introduce the problem formulation using MDP. Secondly, we will introduce local causal models as an extension of the commonly used structural causal models.

2.1 Problem formulation

Reinforcement learning-based recommender systems learn from interactions through a Markov Decision Process (MDP). Given a recommendation problem consisting of a set of users \(\mathcal {U} = {u_0,u_1,\cdots ,u_n}\), a set of items \(\mathcal {I} = {i_0,i_1,\cdots ,i_m}\), and users’ demographic information \(\mathcal {D}={d_0,d_1,\cdots ,d_n}\), MDP can be represented as a tuple \((\mathcal {S},\mathcal {A},\mathcal {P},\mathcal {R},\gamma )\), where each represents the following:

  • \(\mathcal {S}\) denotes the state space, which is the combination of the subsets of \(\mathcal {I}\) and \(\mathcal {D}\). It represents the user’s previous interactions and demographic information. Based on that, it can be written in a compositional form: \(\mathcal {S} = \mathcal {S}^1 \oplus \mathcal {S}^2 \oplus \cdots \oplus \mathcal {S}^n\) for a fixed n, which represents the dynamic count of components [18];

  • \(\mathcal {A}\) is the action space, which represents the agent’s selection during recommendation based on the state space \(\mathcal {S}\). Similarly, it can also be written in a compositional form: \(\mathcal {A} = \mathcal {A}^1 \oplus \mathcal {A}^2 \oplus \cdots \oplus \mathcal {A}^n\);

  • \(\mathcal {P}\) is the set of transition probabilities for state transfer based on the action received, which also refers to users’ behavior probabilities. It is worth mentioning that \(\mathcal {P}\) will not be estimated in this study as we are using a model-free reinforcement learning approach;

  • \(\mathcal {R}\) is a set of rewards received from users, which are used to evaluate the action taken by the recommendation system, with each reward being a binary value to indicate user’s click; \(\gamma \) is a discount factor;

  • \(\gamma \in [0,1]\) for balancing the future reward and current reward.

Given a user \(u_0\) and the state \(s_0\) observed by the agent (or the recommendation system), which includes a subset of the item set \(\mathcal {I}\) and user’s demographic information \(d_0\), a typical recommendation iteration for user \(u_0\) goes as follows: First, the agent makes an action \(a_0\) based on the recommend policy \(\pi _0\) under the observed initial state \(s_0\) and receives the corresponding reward \(r_0\). Then, the agent generates a new policy \(\pi _1\) based on the received reward \(r_0\) and determines the new state \(s_1\) based on the probability distribution \(p(s_{new}\vert s_0,a_0)\in \mathcal {P}\). The cumulative reward after k iterations is as follows (Figure 1):

$$\begin{aligned} r_c = \sum _{k=0}^{\infty } \gamma ^{k}r_k. \end{aligned}$$

2.2 Local causal models

Structural Causal Models (SCMs) [19] can be represented as a tuple: \(\mathcal {M}_t(V_t,U_t,F)\) with the following components based on the state and action composition form at timestamp t. It is normally represented as a directed acyclic graph (DAG) \(\mathcal {G}\) with the following components:

  • \(V_t = {s_t^1, s_t^2, \cdots , s_t^n, a_t^1,\cdots ,a_t^m, s_{t+1}^1,\cdots ,s_{t+1}^n}\), which represents the nodes in DAG \(\mathcal {G}\).

  • \(U_t = {u^1, \cdots , u^{2n+m}}\) is a set of noise variables, one for each node in \(V_t\). It is determined by the initial state, past actions, and environment. We assume that the noise variable is time-independent, which implies that U.

  • F is a set of functions that map \(U_t \times \text {Parentage}(V_t) \rightarrow V_t\), where \(\text {Parentage}(\cdot )\) means the parent node of \(\cdot \).

We assume that the dynamic count of states is n and the dynamic count of actions is m. The state observed at timestamp t is written as \(s_t\), which is the composition of \({s_t^1, s_t^2, \cdots , s_t^n}\) Local causal models are an extension of SCMs that only consider the local causal effect [20]. A local causal model can be represented as \(\mathcal {M}_t^\mathcal {L}(V_t^\mathcal {L},U_t^\mathcal {L},F^\mathcal {L})\), with the DAG \(\mathcal {G}^\mathcal {L}\) derived from the global causal model \(\mathcal {M}_t\) in the subspace \(\mathcal {L}\), having the same components with the following additional constraints:

$$\begin{aligned}&\text {Parentage}(V_t^\mathcal {L}) = \text {Parentage}(V_t\vert (s_t,a_t)\in \mathcal {L}),\end{aligned}$$
(1)
$$\begin{aligned}&\text {Parentage}(U_t^\mathcal {L}) = \text {Parentage}(U_t\vert (s_t,a_t)\in \mathcal {L}). \end{aligned}$$
(2)

Moreover, the local causal model requires the set of edges in \(\mathcal {G}\) to be structurally minimal [21].

3 Methodology

Fig. 1
figure 1

Illustration of a typical interaction between a user and a DRL agent in DRL RS. The red line represents the user’s information flow, and the grey line represents the recommender’s information flow. The states are sampled from the environment

In this section, we will briefly explain the proposed approach for reinforcement learning-based recommendation, Intrinsically Motivated Reinforcement Learning with Counterfactual Augmentation (IMRL), which can address the sparse interactions problem in DRL RS. We are addressing this problem from two aspects based on the aforementioned challenges: i) using a novel adaptive data augmentation method to generate more potentially informative interaction trajectories by employing counterfactual reasoning; and ii) designing a new exploration strategy by introducing an intrinsic reward signal to encourage the agent to explore. In contrast to our conference version, this study offers a more thorough exploration of the mechanism that underlies our proposed data augmentation method. Our focus is on introducing a novel threshold parameter, denoted as \(T_{\max }\), which enables adaptive control over the data augmentation process. This parameter is crucial because we have found that informative trajectories are dependent on the training process.

Hence, the proposed IMRL consists two main components: Counterfactual reasoning for data augmentation and Intrinsically Motivated Exploration.

3.1 Counterfactual data augmentation

Formally, given an arbitrary trajectory \(\tau :(s,a,r,s')\) sampled from the replay buffer, where r is the reward signal received by the agent when action a is executed at state s, most of the trajectories in the large candidate item set situation are not informative, resulting in zero rewards. As a result, it is challenging to sample informative trajectories since the number of non-informative trajectories is significantly larger. Augmenting these informative trajectories to increase the likelihood of sampling them is a straightforward solution. We assume that the state \(s_{t+1}\) satisfy the SCM:

$$\begin{aligned} s_{t+1} = f(s_t,a_t,U_{t+1}), \end{aligned}$$
(3)

\(f(\cdot )\) represents the causal mechanism, \(a_t\) is the action taken at timestamp t, and \(U_{t+1}\) is the noise term that is independent of \((s_t,a_t)\). Our main objective is to estimate the causal mechanism \(f(\cdot )\) and generate more data that is unseen but causally valid. However, estimating the global \(f(\cdot )\) is challenging [22]. To overcome this challenge, we draw inspiration from recent advances in local causal models [20, 23] and focus on estimating the local causal mechanism \(f_l(\cdot )\). The local causal model assumes the existence of a local directed acyclic graph (DAG) in a subspace such that where \(\mathcal {L}: \mathcal {S}\times \mathcal {A}\). It satisfies the following condition:

(4)

where is used to represent the independence. In recommender systems, there is a large subspace of states in which users’ previous interests will not affect the final recommendation, as users’ interests are dynamic. By focusing on the subspace , we can formulate a local causal model such that the local DAG contains no edge from \(V_t^i\) to \(s{t+1}^j\). This implies that the local DAG is strictly sparser than the global DAG \(\mathcal {G}\). With this property, we can use the local causal model to conduct data augmentation to alleviate the sparsity problem in DRL RS.

Consider the counterfactual question “What if user u had been interested in item j instead of item i at timestamp t?” This question can be described in causal form as What if component \(s_t^i\) had the value x instead of y at timestamp t?” It can be solved by using Pearl’s do-calculus to the causal model \(\mathcal {M}\) to obtain a sub-model,

$$\begin{aligned} \mathcal {M}_{\text {do}(s_t^i=x)}^\mathcal {L} = (V,U,F_x) \text { where } F_x = F \setminus f^i \cup {s_t^i = x}. \end{aligned}$$
(5)

Moreover, the incoming edge to \(s_t^i\) will be removed from \(\mathcal {G}{\text {do}(s_t^i=x)}\). Now, we utilize the local causal model to generate data that is unseen by the agent but causally valid. In order to achieve this, we will augment the data based on the counterfactual modification with the subset of causal factors at timestamp t and keep the remaining factors unchanged. Such an augmentation process can use the counterfactual model \(\mathcal {M}{\text {do}(s_t^i=x)}^\mathcal {L}\) to modify the causal factors \(s_t^{i\cdots j}\) and regenerate the corresponding children in the DAG.

However, such a process is computationally expensive in our recommendation scenario as it requires re-sampling for the children in the DAG. Inspired by the idea of collaborative filtering and the state composition form which was mentioned previously, we can simplify the process by omitting the sampling step. Specifically, the core of the augmentation is to estimate \(\mathcal {M}{\text {do}(s_t^i=x)}^\mathcal {L}\), which can be obtained easily by assuming that similar users will have similar interests, an idea inherent in collaborative filtering. Under such an assumption, we can obtain \(\mathcal {M}{\text {do}(s_t^i=x)}^\mathcal {L}\) by replacing the causally independent component of \(s_t\) using the local causal model. For example, we can identify those interaction histories that do not affect the current recommendation, and are thus causally independent of the current action \(a_t\). The overall algorithmFootnote 1 can be found in Algorithm 1.

Algorithm 1
figure a

Counterfactual Data Augmentation

3.2 Intrinsically motivated exploration

The second aspect we use to address sparsity is intrinsically motivated exploration strategies. We propose to use empowerment to represent intrinsic motivation, which can boost the agent’s exploration capability, allowing it to reach more states and produce corresponding potentially informative interaction trajectories.

Empowerment is an information-theoretic method in which an agent executes a sequence of k actions \({\textbf {a}}^k \in \mathcal {A}\) while in state \(s \in \mathcal {S}\), according to an exploration policy \(\pi _{\textit{empower}}(s,{\textbf {a}}^k)\) (which we will shorten to \(\pi _e(s,{\textbf {a}}^k)\)). This exploration policy is a conditional probability distribution: \(\pi _e:\mathcal {S}\times \mathcal {A}\rightarrow [0,1]\). The agent’s goal is to identify an optimal policy \(\pi _e\) that maximizes the mutual information \(I[{\textbf {a}}^k, s'\vert s]\) between the action sequence \({\textbf {a}}^k\) and the state \(s'\) to which the environment transitions after executing the sequence \({\textbf {a}}\) in the current state s. This can be formulated as follows:

$$\begin{aligned} \overline{\mathbb {E}}(s)&= \max _{\pi _e} I[{\textbf {a}}^k, s'\vert s] \end{aligned}$$
(6)
$$\begin{aligned}&= \max _{\pi _e}\mathbb {E}_{\pi _e(s,{\textbf {a}}^k)\mathcal {P}(s,{\textbf {a}}^k,s')}\log \left [\frac{p({\textbf {a}}^k,s'\vert s)}{\pi _e({\textbf {a}}^k,s)}\right]. \end{aligned}$$
(7)

The aim is to maximize the expectation of the logarithmic ratio between the joint probability distribution of the action sequence \({\textbf {a}}^k\), the state \(s'\), and the state transition probability distribution \(\mathcal {P}(s,{\textbf {a}}^k,s')\), and the exploration policy \(\pi _e({\textbf {a}}^k,s)\). By maximizing this quantity, the agent can boost its exploration capability and reach more states, which can in turn produce potentially informative interaction trajectories.

Here, \(\overline{\mathbb {E}}(s)\) refers to the optimal empowerment value, and \(\mathcal {P}(s,{\textbf {a}}^k,s')\) refers to the probability of transitioning to \(s'\) after executing the action sequence \({\textbf {a}}^k\) in state s, where \(\mathcal {P}:\mathcal {S}\times \mathcal {A}\times \mathcal {S} \rightarrow [0,1]\). Importantly,

$$\begin{aligned} p({\textbf {a}}^k,s'\vert s) = \frac{\mathcal {P}(s,{\textbf {a}}^k,s')\pi _e({\textbf {a}}^k,s)}{\sum _{{\textbf {a}}^{k'}} \mathcal {P}(s,{\textbf {a}}^{k'},s')\pi _e({\textbf {a}}^{k'},s)} \end{aligned}$$
(8)

is the inverse dynamics model of \(\pi _e\). The optimal empowerment values are obtained by the policy \(\pi ^*\) that maximizes \(\mathbb {E}^{\pi ^*}(s)\).

However, the above definition of empowerment is more general than the RL setting since it considers a k-step policy, while RL usually considers a single-step policy. Moreover, estimating the k-step empowerment is challenging. Therefore, in this study, we use \(k=1\) to narrow down the empowerment into the RL setting, which only considers one step ahead. The Blahut-Arimoto algorithm [24, 25] shows that empowerment can be solved in low-dimensional discrete settings. Additionally [26], uses parametric function approximators to estimate empowerment in high-dimensional and continuous state-action spaces. It provides theoretical guarantees for using empowerment in recommender systems since state-action spaces are high-dimensional [1]. There are two possibilities for utilizing empowerment in RL:

  • Find high mutual information between actions and the subsequent state achieved by that action.

  • Train a behavioral policy to take an action in each state such that the expected empowerment value of the next state is highest.

Both approaches are feasible for the normal reinforcement learning setting, which can encourage the agent to take an action that can result in the maximum number of future states. However, there is some conceptual difference between them. The second approach seeks states with a large number of reachable next states [27, 28], while the first approach aims to find high mutual information between actions and subsequent states, which is not necessarily the same as seeking highly empowered states [26]. The first approach can be achieved by transforming the state and its subsequent states’ representations into KL divergence and minimizing it [29]. However, this transformation introduces extra complexity and information loss, which may affect performance. The second approach, which uses the behavioral policy to explore highly empowered states, would be more suitable and simple for our setting. The main reason is that we are using a model-free approach to solve the problem. Model-free RL methods maintain two policies, which are the target policy \(\pi \) and the behavior policy \(\pi _e\). The second approach is more suitable for model-free approaches as it does not require extra computational cost to traverse all subsequent states and calculate the KL divergence. It can be easily adopted into existing RL frameworks.

Hence, the goal of the MDP process with the empowerment can be rewritten as,

$$\begin{aligned} \max _{\pi _b} \mathbb {E}_{\pi _b,\mathcal {P}} \left [\sum _{t=0}^\infty \gamma ^t (\alpha \cdot R(s_t,a_t) + \beta \cdot \frac{p(a_t\vert s_{t+1},s_t)}{\pi _b(a_t,s_t)}\right ] \end{aligned}$$
(9)

\(\pi _b\) is the behavior policy, and \(\alpha \) and \(\beta \) are constants used to balance the weight of instant reward and empowerment. We include the empowerment term as an additional component to the reward signal \(R(s_t,a_t)\) in order to encourage exploration by the agent.

3.3 Training procedure

From an information-theoretic perspective, optimizing for empowerment is equivalent to optimizing the inverse dynamics [28, 30] based on the distribution \(\pi _e(s,a)\). Therefore, we introduce the inverse dynamics into the objective function to calculate the empowerment. Our method is built on Soft Actor-Critic (SAC) [31] with temperature tuning and deterministic policy. The overall training algorithm can be found in Algorithm 2.

We follow the same training strategy as the standard SAC algorithm. However, since the empowerment is introduced, we modify the objective function to ensure that the empowerment term can be optimized. We use several function approximators to learn different components in the proposed method. The value function V is parameterized by \(\psi \), the Q-function is parameterized by \(\theta \), the target policy is parameterized by \(\phi \), and the inverse dynamics is parameterized by \(\xi \). Since we are using an off-policy algorithm where the transition probability is not learned, we use \(\mathcal {P}\) to represent the state transition probability in the environment. The soft Q-function can be trained by minimizing the following objective function:

$$\begin{aligned} J_Q(\theta ) = \mathbb {E}_{(s_t,a_t)\sim \mathcal {D}}\big [Q_\theta (s_t,a_t) - (r(s_t,a_t) + \gamma V_{\psi }(s_{t+1}))^2\big ]. \end{aligned}$$
(10)

The target function \(V_\psi \) can be optimized by minimizing:

$$\begin{aligned} J_V(\psi ) = \mathbb {E}_{s_t\sim \mathcal {D}} \Big [V_\psi (s_t) -\mathbb {E}_{a_t\sim \pi _\phi }\big (Q_\theta (s_t,a_t)+\underbrace{\beta g(s_t,a_t)}_{\text {policy}}\big )^2\Big ], \end{aligned}$$
(11)

where \(\beta \) is a constant used to balance the empowerment. Different from the origin SAC algorithm, we replace the policy term from \(-\log \pi _\phi (s_t,a_t)\) to \(g(s_t,a_t)\) to consider the empowerment where \(g(s_t,a_t)\) is defined as:

$$\begin{aligned} g(s_t,a_t) = \mathbb {E}_{\mathcal {P}(s'\vert s_t,a_t)}\big [\log p_\xi (a_t\vert s',s_t) -\log \pi _\phi (s_t,a_t)\big ]. \end{aligned}$$
(12)

Note that, different from \(s_t\), the \(s'\) represent all the possible subsequent states where \(a_t\) is executed in state \(s_t\) at timestamp t.

Algorithm 2
figure b

Overall training algorithm

Similarly, the optimization of the policy \(\pi (\phi )\) can be written as:

$$\begin{aligned} J_\pi (\phi ) = -\mathbb {E}_{s_t\sim \mathcal {D}}\Big [\mathbb {E}_{a_t\sim \pi _\phi }\big [\beta g(s_t,a_t)+Q_\theta (s_t,a_t)\big ]\Big ], \end{aligned}$$
(13)

where apply the same substitution. The inverse dynamic \(p(\xi )\) will be updated based on:

$$\begin{aligned} J_p(\xi ) = -\mathbb {E}_{\pi _\phi }\big [\log {p_\xi }(a_t\vert s',s_t)\big ]. \end{aligned}$$
(14)

Lastly, the temperature parameter will be adjusted automatically by using the following entropy method [32]:

$$\begin{aligned} J(\alpha ) = -\mathbb {E}_{a_t\sim \pi _\phi }\big [\alpha \log \pi _\phi (s_t,a_t) + \alpha \mathcal {H}\big ]. \end{aligned}$$
(15)

Note that, SAC uses exponentially averaged value \(\psi '\) to stabilize the training process [33]. The update rule can be written as: \(\psi ' \leftarrow \lambda _{\psi '} \psi + (1-\lambda _{\psi '})\psi '\).

Moreover, we perform data augmentation on the replay buffer after each interaction to generate causally valid, unseen trajectories. This augmentation can provide more trajectories at the early stage to increase the number of samples. Specifically, most of the model parameters are learned by sampling from the replay buffer \(\mathcal {D}\). The training process can be described as searching for states or state-action pairs in \(\mathcal {D}\) to update the target policy such that the received reward is maximized. As the augmentation introduces more samples into the replay buffer, the gradient update process has a higher chance of achieving a better policy.

It is worth mentioning that we only augment informative trajectories. However, the definition of informative trajectories highly depends on the learning progress. We believe that every trajectory with non-zero reward is informative in the early stages but harmful when the final target policy is close to optimal. Hence, we selectively conduct augmentation with the replay buffer to ensure that zero-reward trajectories are not augmented to increase sparsity. However, as the interaction progresses, the way we determine informative trajectories changes. Some trajectories are informative in the early stages as the agent needs to explore all possibilities. In the later stages, the agent will pursue higher rewarding trajectories, making those low-rewarding trajectories less useful. In such situations, we design an adaptive threshold to evaluate whether the trajectory is worth augmenting or not. The designed adaptive threshold is intuitive, where a moving average is used. It can be represented as:

$$\begin{aligned} T = \sigma /\lambda _d \text { if } T \le T_{max} \text { else } T_{max} \end{aligned}$$
(16)

\(\sigma \) is a custom constant used to determine the initial value of the threshold, and the decay rate \(\lambda _d \in (0,1]\) decreases as the number of episodes increases. By setting \((\sigma , \lambda _d)\) to appropriate values, we can achieve a monotonically increasing threshold. Ideally, we start with the initial values \(\sigma = 1\) and \(\lambda _d = 1.1\). \(T_{max}\) is a constant specific to the environment that represents the maximum reward that the agent can achieve at each step.

4 Experiments

In this section, we conduct experiments to answer three main research questions:

  • RQ1: Does IMRL outperform existing DRL approaches in both offline and online settings?

  • RQ2: Can IMRL help to alleviate the sparsity of interactions problem in DRL RS in online simulation environments?

  • RQ3: How does each component contribute to the final performance in online simulation environments?

In contrast to our conference version, we offer a thorough examination of each aspect of our proposed technique. In order to substantiate the effectiveness of our approach, we perform experiments on two more offline datasets and two online environments.

4.1 Experiment setup

In order to demonstrate the superiority of IMRL, we conducted experiments in both offline and online simulation settings.

4.1.1 Offline datasets

We use six publicly available datasets:

  • MoveLens-20MFootnote 2 is a dataset about the user behavior of watching movies.

  • LibrarythingFootnote 3 is a dataset about book review information.

  • Book-crossingFootnote 4 is a dataset related to book preference.

  • Netflix PrizeFootnote 5 is a dataset from Netflix yearly competition for recommendation.

  • Amazon-CDFootnote 6 is e-commerce datasets which contains user’s purchase behavior.

  • GoodReadsFootnote 7 is a book dataset.

The statistics of those datasets are summarized in Table 1.

Table 1 Statistics of the datasets used in our offline experiments

Moreover, due to the unique interaction logic in reinforcement learning-based methods, an additional data preparation process is required to ensure that the agent can interact with offline datasets. We adopt the same strategy as in previous work [34] to convert those datasets into reinforcement learning environments so that IMRL can interact with them.

4.1.2 Baselines and offline evaluation metrics

We selected the following baselines, which include both non-reinforcement learning based methods and reinforcement learning based methods:

  • SASRec [35], a well-known baseline for sequential recommendation methods that utilize the self-attention mechanism.

  • CASR [14], a counterfactual data augmentation method for sequential recommendation. As CASR only conducts the augmentation, we selected STAMP [36] to make recommendations, which is described in CASR.

  • CauseRec [15], a counterfactual sequence generation method for sequential recommendation.

  • CoCoRec [37], a category-aware collaborative method for sequential recommendation.

  • CGKR [38], a counterfactual generation method for alleviating spurious correlations.

  • DEERS [36], a reinforcement learning-based recommendation method that considers both positive and negative feedback.

  • KGRL [6], a reinforcement learning-based method that utilizes the capability of GCN to process the knowledge graph information.

  • TPGR [7], a model that uses reinforcement learning and binary trees for the large-scale interactive recommendation.

  • PGPR [39], a knowledge-aware model that employs reinforcement learning for explainable recommendation.

It is worth mentioning that because of the different training paradigms of these two kinds of methods (i.e., supervised learning and reinforcement learning), we cannot guarantee that the comparison with existing non-reinforcement learning-based state-of-the-art methods is strictly fair. We conducted the supervised learning-based methods in the same setting as well as the reinforcement learning-based methods.

We used the same training and hyperparameter settings as in [15] for the non-DRL based methods. We used Adam as the main optimizer, with an embedding size of 32 and a batch size of 1024. For IMRL, we use 100, 000 episodes and a batch size of 1024. For DRL-based methods, we trained all models with 100, 000 episodes on VirtualTB, but only 1, 000 episodes on RecoGym and RecSim. As reported in the original paper, we set all hyperparameters to their default values. For the proposed model, \(T_{max}\) was set to 10 for VirtualTB, and 1 for RecoGym and RecSim. The learning rate was set to 0.001. Recall, precision, nDCG are selected as the evaluation metrics. And they are reported based on top-20 recommendation. It should be noted that IMRL achieves lower precision scores than the baselines on LibraryThing and Book-Crossing datasets. The slight decrease of 0.25% in precision on LibraryThing is not surprising and could be attributed to the randomness of the model, which is still considered acceptable. However, for Book-Crossing, IMRL shows a significant drop in precision compared to CauseRec. This could be due to the high sparsity of the Book-Crossing dataset, which may require more episodes to collect trajectories or a strong exploration strategy which can be left as the future direction. It is important to consider that supervised learning and reinforcement learning have different learning paradigms, making it difficult to control the number of episodes required.

4.1.3 Online simulation

Unlike offline datasets, online simulation platforms are based on gymFootnote 8, which is a standard toolkit for reinforcement learning research. We conducted online experiments on three widely used public simulation platforms that mimic online recommendations in real-world applications: VirtualTB [40], RecSim [41], and RecoGym [42].

VirtualTB

is a real-time simulation platform for recommendation, in which the agent recommends items based on users’ dynamic interests. It uses a pre-trained generative adversarial imitation learning (GAIL) model to generate different users who have both static and dynamic interests. The interactions between users and items are also generated by the GAIL model. This allows VirtualTB to provide a large number of users and corresponding interactions to simulate real-world scenarios. After initialization, VirtualTB generates different users each time, and the dynamic interests of each user change after each interaction.

RecSim

is a platform for creating configurable simulation environments that support sequential interactions between users and recommender systems. Unlike VirtualTB, RecSim has fewer users and items but offers a range of simpler tasks. For our experiments, we chose to use the "interest evolution" task, which encourages the agent to explore and satisfy the user’s interests without further exploitation.

RecoGym

is a smaller platform where users do not have long-term goals. Unlike RecSim and VirtualTB, RecoGym is designed for computational advertising. Similar to RecSim, RecoGym uses clicks or non-clicks to represent the reward signal. Additionally, like RecSim, users in these two environments do not have any dynamic interests.

Table 2 The overall results of our model comparison with several state-of-the-art models in different datasets

4.1.4 Baselines for online simulation

In our online simulation experiments, all the baselines are based on reinforcement learning. Therefore, non-reinforcement learning methods are excluded as they cannot interact with gym-based environments. It is worth mentioning that some methods require additional side information from the environment that is not present in these three platforms. Therefore, we had to remove those components to ensure a fair comparison, where every method received the same state representation. The primary evaluation metric used for online simulation is Click-Through-Rate (CTR), which is determined by the platform.

IMRL is implemented using Pytorch [43]. All experiments are conducted on a server with two Intel Xeon E5-2697 v2 CPUs, 4 NVIDIA TITAN X Pascal GPUs, 2 NVIDIA TITAN RTX GPUs, 2 NVIDIA RTX A5000 GPUs, and 768 GB of memory. We provide details about the model parameters for reproducibility. The hidden units for both the actor and critic networks are set to 256. The learning rate, discount factor, and size of the replay buffer are set to 0.0003, 0.99, and 1e6, respectively, during the experiments. For VirtualTB, the training episode is set to 1e6, and testing is conducted every 10 episodes. For RecoGym and RecSim, the training episode is set to 10,000, and testing is conducted every 10 episodes.

Fig. 2
figure 2

Overall results for online simulation environments

4.2 Offline experiments

The complete results can be found in Table 2. We found that our method IMRL generally outperforms all existing state-of-the-art methods, including both non-reinforcement learning-based methods and reinforcement learning-based methods. It should be noted that IMRL does not outperform CauseRec in two datasets, but it still performs better than all other methods. Although IMRL has lower precision than CauseRec in Book-Crossing, we observe that the recall and nDCG are better than CauseRec. A similar situation occurs in Netflix, where the nDCG of IMRL is lower than CauseRec, but precision and recall are better than CauseRec.

4.3 Online experiments (RQ2)

We also report the performance of the selected reinforcement learning-based baselines in three online simulation environments. The results can be found in Figure 2. As we can see, IMRL outperforms all the others in all of the selected three simulation platforms. The performance in RecoGym and RecSim is quite close as those two environments are very small and do not require a complex exploration policy. Hence, we will focus on the later discussion in VirtualTB as it has a more complex environment which is more similar to the real-world situation.

The simplest way to evaluate the sparsity is to measure the speed of the model that tends to converge. In reinforcement learning, the way we use to measure sparsity is the number of useful samples that are fed into the agent via the replay buffer or sampled from the environment. Hence, dense environments can boost the model to converge at an early stage. In Figure 2a, we can see that IMRL has an outstanding speed of convergence than other methods in VirtualTB, which shows that it can overcome the sparse environment. In RecoGym and RecSim, IMRL also demonstrates considerable improvement when compared with those baselines. The main reason is that RecoGym and RecSim are small environments that contain only a few items and users. The sparsity is not serious and can be handled by random exploration.

Fig. 3
figure 3

Ablation study on VirtualTB

4.4 Ablation study (RQ3)

To answer RQ3, we conducted experiments with the two major components of IMRL: empowerment and augmentation. The results of this study can be found in Figure 3, where IMRL-E denotes IMRL without empowerment, and IMRL-A denotes IMRL without augmentation. Additionally, we investigated the effect of the different strategies of empowerment in IMRL, including the KL-divergence approach mentioned in Section 3.2. We use IMRL-KL to represent this method.

We observed that both components play an important role in IMRL and contribute jointly to its final performance. Furthermore, we noticed that IMRL-KL did not perform as well as the other methods. One possible reason for this is that information is lost during the transformation in the calculation of KL-divergence. Hence, we can infer that our approach of using empowerment is better than KL-Divergence. In the next part, we will investigate the effect of the adaptive threshold.

We observed that removing empowerment-based exploration resulted in a drop in the model’s performance. As we mentioned earlier, empowerment-based exploration has a higher probability of reaching or producing informative states, which can enhance the model’s performance. This indicates that random exploration has limitations and may harm the model’s performance in recommendation tasks. However, we did not observe a significant improvement with the augmentation component, only a slight one. One possible reason for this is that VirtualTB is a simulation platform that can simulate real-world situations, but the number of states is limited due to computational resource constraints. A larger simulator may reveal a more pronounced difference in performance between augmented and non-augmented models. However, further study is required, and it is not the goal of this paper.

4.5 Impact of the adaptive augmentation

An important difference between our previous work [17] and this study is the role played by the adaptive augmentation threshold in early-stage discovery and reducing the number of uninformative trajectories. In this section, we investigate how the threshold affects early-stage performance. We start with \(\sigma =10\) and \(\lambda _d=1.1\). Note that the decay function of \(\lambda _d\) varies depending on the environment. We use VirtualTB as the primary evaluation platform and the following decay function:

$$\begin{aligned} \lambda _d \leftarrow \lambda _d - \Big \lceil \frac{ \# \ \text{of episodes}}{100,000} \Big \rceil , \end{aligned}$$

with \(T_{max} = 10\). We report the CTR at different stages of IMRL with adaptive threshold compared to IMRL without adaptive threshold (referred to as IMRL-T for short) in Table 3. We repeated the experiments five times with five different random seeds and report the average value. We found that with the adaptive threshold, IMRL can reach peak performance around 70, 000 episodes, whereas, without the adaptive augmentation threshold, it takes until 90, 000 episodes to reach peak performance. However, when the number of episodes reaches 100, 000, the performance of both methods is similar, which supports our theory that adaptive augmentation can improve the early-stage performance of the model.

Table 3 The effect of the adaptive threshold in VirtualTB

5 Related work

In this section, we will briefly review two topics related to our work: reinforcement learning-based recommendation and causality in recommender systems.

Reinforcement learning-based recommendation

Reinforcement learning (RL) has been used in recommendation systems (RS) to provide personalized recommendations. Zheng et al. [2] introduced deep RL into RS using the Deep Q-Network (DQN) to recommend news articles. Double DQN was used to build a user’s profile, and an activeness score was designed to evaluate whether the user is active or not. Zhao et al. [36] extended this method by introducing negative feedback. Chen et al. [34] used cascading DQN and a generative user model method to handle unknown reward situations. Chen et al. [44] introduced a scalable policy-gradient-based method for recommendation by introducing a policy correction gradient estimator to reduce the variance, and [4] designed a Pairwise Policy Gradient method to reduce variance. Chen et al. [7] proposed a tree-based method for large-scale interactive recommendation using the actor-critic algorithm. [6] integrated the knowledge graph into the actor-critic structure and used graph convolutional networks to capture information. Xian et al. [39] designed a knowledge graph-based environment for explainable recommendations. Chen et al. [9] focused on reward function design and used inverse RL to avoid elaborate reward functions in the online recommendation.

Causality in recommender systems

: Causality has become a popular research topic in recent literature on recommendation systems due to its wide usage in debiasing and data augmentation for RS. For example, [45] employed model-agnostic counterfactual reasoning to address popularity bias in RS, while [46] proposed causal intervention. Conversely, [15] separated users’ historical actions into dispensable and indispensable items and generated new user sequences by replacing dispensable items. Causality has shown a strong connection with RL in recent years as both can affect the input’s status [47]. Zhu et al. [48] employed actor-critic algorithms to discover different Directed Acyclic Graph (DAG) structures for causal discovery. Dasgupta et al. [49] proposed a meta-RL framework to conduct causal reasoning by exploring different causal structures. The causal inference has also been used to determine unobserved confounders to improve the performance of imitation learning [50], and [51] utilized causal inference to build an explainable RL model. Moreover, recent works [52, 53] are focusing on using causality to enhance interpretability and debias.

Our contributions

We propose a novel end-to-end model called Intrinsically Motivated Reinforcement Learning with Counterfactual Augmentation (IMRL), which focuses on two key aspects: enhancing informative trajectories and introducing a new exploration strategy. Our approach incorporates a unique empowerment-based exploration strategy that motivates the agent to explore informative interaction trajectories in a sparse environment. Additionally, we introduce a new counterfactual data augmentation technique for Deep Reinforcement Learning Reward Shaping (DRL RS), which amplifies the exposure probability of these newly discovered informative trajectories, thereby improving the overall performance of the model.

6 Conclusion

In this paper, we propose IMRL to address the sparse interaction problem in DRL-based RS from two perspectives: quantity and quality. We propose a counterfactual-based method to augment informative interaction trajectories and an empowerment-based exploration to boost the possibility of finding high-quality trajectories. We conducted experiments on both offline datasets and online simulation platforms to demonstrate the superiority of the proposed method.

In the future, we plan to explore the potential of empowerment and develop novel solutions to address the sparse interaction problem in DRL-based RS. Additionally, one of the limitations of IMLR is the sub-optimality problem. Although the proposed adaptive method can enhance performance at the initial stage, it may result in sub-optimal trajectories being augmented. To address this, we aim to design a more detailed adaptive strategy, rather than a conventional one, to further improve the performance of the proposed model in the future.