A Survey on Recent Advances and Challenges in Reinforcement Learning Methods for Task-Oriented Dialogue Policy Learning

Dialogue Policy Learning is a key component in a task-oriented dialogue system (TDS) that decides the next action of the system given the dialogue state at each turn. Reinforcement Learning (RL) is commonly chosen to learn the dialogue policy, regarding the user as the environment and the system as the agent. Many benchmark datasets and algorithms have been created to facilitate the development and evaluation of dialogue policy based on RL. In this paper, we survey recent advances and challenges in dialogue policy from the prescriptive of RL. More specifically, we identify the major problems and summarize corresponding solutions for RL-based dialogue policy learning. Besides, we provide a comprehensive survey of applying RL to dialogue policy learning by categorizing recent methods into basic elements in RL. We believe this survey can shed a light on future research in dialogue management.


Introduction
TDS aims to assist users to accomplish tasks ranging from weather inquiry to schedule planning (Chen et al., 2017a).The architecture of TDS can be classified into two classes.The first one is an end-to-end approach that directly maps the user's utterance to the system's natural language response (Lewis et al., 2017;Eric and Manning, 2017;Chi et al., 2017;Wang et al., 2020b).These works often adopt a sequence-to-sequence model and train in a supervised manner.The second one is a pipeline approach that separates the system into four interdependent components: natural language understanding (NLU), dialogue state tracking (DST), dialogue policy learning (DPL) and natural language generation (NLG) as shown in Figure 1 (Liu and Lane, 2018;Chen et al., 2019a;Wu et al., 2019;Li et al., 2020a).
* These authors contributed equally to this work.Under this pipeline approach, the NLU module first recognizes the intents and slots from the input sentence given by a human.Then, the DST module represents it as an internal dialogue state.Next, the DPL module performs an action to satisfy the user.Finally, The NLG module transforms the action transformed into natural language.The end-to-end approach is more flexible and has fewer requirements on the data annotations formats.However, it requires a large amount of data and its black box structure gives us no interpretation and controls (Gao et al., 2019).The pipeline approach is more interpretable and easier to implement.Although the whole system is harder to optimize globally, it's preferred by most commercial dialogue systems (Zhang et al., 2020b).In the pipeline approach, DPL plays a key role in TDS as an intermediate juncture between the DST and NLG components.
In recent years, we have witnessed the prosperity of the application of RL in DPL.Levin et al. (1997) is the first work that treats DPL as an MDP problem.It outlines the complexities of modelling DPL as an MDP problem and justifies the use of RL algorithms to optimize the MDP problem.Thereafter, there exist few works that extend the RL approach and identify the challenge in approximating the dialogue state (Walker, 2000;Singh et al., 2000Singh et al., , 2002)).
At the other end of the spectrum, several researchers explored using supervised learning (SL) techniques in DPL (Gandhe and Traum, 2007;Henderson et al., 2008;DeVault et al., 2011;Vinyals and Le, 2015;Shang et al., 2015).The main idea was to train the model to output the next system action given the dialogue context.However, SL does not consider the future effects of the current decision which may lead to sub-optimal behaviour (Henderson et al., 2008).
With the breakthroughs in deep learning, deep reinforcement learning (DRL) methods that combine neural networks with RL has recently led to successes in learning policies for a wide range of sequential decision-making problems.This includes simulated environments like the Atari games (Mnih et al., 2013)), the chess game Go (Silver et al., 2016) and various robotic tasks (Ng et al., 2006;Peters and Schaal, 2008).Following that, DRL have been receiving a lot of attention and achieved successful results mainly in single domain dialogue scenario (Su et al., 2016a;Fatemi et al., 2016;Su et al., 2017;Lipton et al., 2017).The neural models can extract high-level dialogues states that encode the complicated and long language utterances, which is the biggest challenge that the early works were facing (Levin et al., 1997;Singh et al., 2000).As the focus of DPL research has slowly gravitated to more complicated multidomain datasets, many RL algorithms face scalability problems (Cuayáhuitl et al., 2016).
Recently, there has been a flurry of works that focus on ways to adapt and improve RL agents in the multi-domain scenario.Few works attempt to review the vast literature on the recent application of RL in DPL of TDS.Graßl (2019) surveyed the use of RL in the four types of the dialogue system, namely social chatbots, infobots, task-oriented and personal assistant bots.However, the progress and challenges of using RL in TDS were not discussed.Dai et al. (2020) reviewed the recent progress and challenges of dialogue management which only contains a limited discussion on RL methods in DPL due to its wide scope of interest.While they pointed out the three main shortcomings of dialogue management that recent works have been addressing, a taxonomy of the methodologies is not provided.A comprehensive survey that summarizes the recent challenges and methodologies of applying RL in DPL of TDS is still lacking which motivates this survey.
In this survey, we will focus our discussion on three main recent challenges of applying RL to DPL of TDS, namely exploration efficiency, cold start problem and large state-action space.These are the prominent challenges in recent work on DPL that the majority of recent literature are trying to address.The procedure that we use to shortlist the papers for review is provided in Appendix B. We will also give an overview of recent works that tackle those challenges.The remainder of this paper is organized as follows.In Section 2, we first provide the problem definition of DPL and elaborate on the recent challenges of using RL to train a dialogue agent in TDS.Then, we motivate and introduce one of the contributions of this survey, a typology of recent DPL works that tackle the mentioned challenges in DPL.The typology is based on the five elements in RL: Environment, Policy, State, Action and Reward.which are discussed separately in Section 3, 4, 5, 6, 7 respectively.The topology is motivated by the fact that the key differentiating aspect of recently proposed methods can be boiled down to these five fundamental elements of RL.This allows us to highlight the similarities and differences between the methods.
In Section 8, we present the challenges of applying RL dialogue agents in real-life scenarios and three promising future research directions.Finally, we conclude the survey in Section 9.
To sum up, the contributions are: 2020b; Cao et al., 2020).Formally, an MDP is defined as a five element tuple (S, A, P, R, γ).S refers to the dialogue state space that holds the necessary information for the policy to make a decision.A refers to the set of all system actions.P (s |s, a) refers to the transition model S × A × S → [0, 1] of the environment.R(s, a) is the reward function S × A → R that provides an immediate reward for each turn.γ ∈ (0, 1] is the discount factor.Figure 2 provides an overview of the MDP framework. A full turn of dialogue interactions can be viewed as a trajectory (s 1 , a 1 , r 1 , . ..) which is generated by the following process in each step.First, the dialogue agent observes the current environment states s t ∈ S and responds with an action a t ∈ A. Second, the environment receives the action and transits to a new state s t+1 ∈ S according to the transition model.Third, the environment emits a reward r t after transiting to a new state.
At each step t, this process gives us a tuple (s t , a t , r t , s t+1 ) which is called a transition.The goal of the RL agent is to learn an optimal deterministic policy π : S → A that maximizes the value function which is the expected total discounted returns in a trajectory.It is formally defined as

Recent Challenges in Applying RL
In recent years, DPL researches aim at tackling three main challenges in using RL to train a dialogue agent in a TDS.
1. Exploration Efficiency.It is arduous to find good data sources for an RL agent to learn.RL interacts with an environment to collect interactions for training.In the dialogue system setting, the agent is required to interact with real users (Su et al., 2016b) which is expensive and timeconsuming.In practice, the agent interacts with a rule-based user simulator (Schatzmann et al., 2007;Su et al., 2016a).The exploration efficiency depends on how closely the simulator resembles human behaviour, which is not easy (Walker et al., 1997;Liu and Lane, 2017).It is laborious to build high quality and specialized user simulator for a dataset.
2. Cold Start Problem.A poorly initialized policy may lead to low-quality interactions with users in online learning settings (Chen et al., 2017b).Having rare successful experiences causes the model to learn slowly in the beginning and discourages real users to interact with the system (Lu et al., 2018(Lu et al., , 2020)).
3. Large State-Action Space.DPL for some complex dialogue tasks such as multi-domain involve a large state-action space (Peng et al., 2018a;Gordon-Hall et al., 2020a).The dialogue agent is required to explore in this large space and often takes many conversation turns to fulfil a task.The long trajectory results in a delayed and sparse reward, which is usually provided at the end of a conversation (Liu and Lane, 2018).

Typology of Approaches
RL system is composed of five elements: environment, policy, state, action and reward.All the proposed approaches that improve dialogue RL agent can be boiled down to modifications to those five elements.This motivates us to classify the recent approaches in RL dialogue agents by these five elements.This typology not only enables us to outline similarities and differences between different approaches in a concise manner, but it also allows us to identify the focal points of recent advancement of RL methods in DPL researches starkly.

Environment
In a typical scenario of DPL, there are two speaker roles: user and system.Most of the current methods are single-agent that only model the system side, regarding the user side as the environment (Su et al., 2015a(Su et al., ,b, 2016a;;Peng et al., 2017;Su et al., 2017;Gordon-Hall et al., 2020b;Li et al., 2020b).Few methods model two roles in n dialogues (Liu and Lane, 2017;Papangelis et al., 2019;Zhang et al., 2020a) and rare works consider the multi-person (more than two persons) dialogue .In this section, we will illustrate (1) different methods to build a user simulator (i.e. the environment), and (2) how to model different agents simultaneously.

Single-Agent / User Simulator
Most previous works build a user simulator first and interact with the single system agent using the simulator to obtain a large number of simulated user experiences for RL algorithms.Building a reliable user simulator, however, is not trivial and often requires much expert knowledge or abundant annotated data (Takanobu et al., 2020a).There are two major methods to build a user simulator.Agenda-based simulator: With the growing need for the dialogue system to handle more complex tasks, it will be more challenging and laborious to build a fully rule-based user simulator, which requires extensive domain knowledge and expertise.An Agenda-based simulator (Schatzmann et al., 2007;Schatzmann and Young, 2009;Li et al., 2016;Ultes et al., 2017) starts a conversation with a randomly generated user goal that is unknown to the dialogue manager.It keeps a stack data structure (i.e.user agenda) during the course of the conversation.Each entry in the stack maps to an intention the user aims to achieve, and the order follows the first-in-last-out operations of the agenda stack (Gao et al., 2018).An agenda-based simulator stores all information that the user needs to inform and acquire, acting according to pre-defined rules.Data-driven simulator: Another method to build a user simulator is to utilize a sequence-to-sequence framework, aiming to generate user response (utterance or dialogue actions) given current dialogue context (Sutskever et al., 2014).The dialogue context consists of historical dialogue content, dialogue goal, constraint status and request status.This method can be learned and optimized directly from a large amount of human-human dialogue corpora (Eckert et al., 1997;Levin et al., 2000;Chandramohan et al., 2011;Asri et al., 2016).
Although there are several ways to build a user simulator, the gap between user simulator and hu-mans make the dialogue policy optimization harder (Gao et al., 2018).Besides, it remains challenging to evaluate the quality of a user simulator, as it is unclear to define how closely the simulator resembles real user behaviours (Williams, 2008;Ai and Litman, 2008;Pietquin and Hastie, 2013).

Multi-Agents
The goal of RL is to discover the optimal strategy π * (a|s) of the MDP.It can be extended into the N-agents setting where each agent has its own set of states S i and actions A i .In Multi-Agent Reinforcement Learning (MARL), the state transition s = (s 1 , ..., s N ) −→ s = (s 1 , ..., s N ) depends on the actions taken by all agents (a 1 , ..., a N ) according to each agent's policy π i (a i |s i ) where s i ∈ S i , a i ∈ A i , and similar to single-agent RL, each agent aims to maximize its local total discounted return R i = t γ t r i,t .
Instead of employing a user simulator, Georgila et al. ( 2014) demonstrated that two agents learn concurrently by interacting with each other without any need for simulated users can achieve satisfactory performance in a negotiation scenario.Liu and Lane (2017) makes the first attempt to apply MARL into the task-oriented dialogue policy to learn the system policy and user policy concurrently.It optimizes two agents from the corpus by iteratively training the system policy and the user policy with the policy gradient method.Thereafter, Papangelis et al. (2019) applied WoLF-PHC within the MARL framework into the task-oriented dialogue policy, which is based on Q-learning for mixed policies to achieve faster learning.Following this line of research, Takanobu et al. (2020a) scaled it to multidomain dialogue by using the actor-critic framework instead to deal with the large discrete action space in dialogue.Recent work extends traditional two-agent to three-agent, leading to smaller action space and faster learning (Wang and Wong, 2021).Another work explores the MARL framework in a different perspective (Gašic et al., 2015).They use MARL in the policy committee framework where each policy decides an action on its own and is combined by a gating mechanism.

Policy
In this section, we firstly divide different DPL methods into two categories: model-free reinforcement learning and model-based reinforcement learning, in which the former methods can be further divided into hierarchical reinforcement learning (i.e, HRL) (Parr and Russell, 1998;Dietterich, 2000) and feudal reinforcement learning (i.e, FRL) (Young et al., 2013).In addition, most of these methods requires warm up before training which is illustrated at the last.

Model-Free RL -HRL
Solving composite tasks, which consist of several inherent sub-tasks, remains a challenge in the research area of dialogue systems.For instance, a composite dialogue of making a hotel reservation involves several sub-tasks, such as looking for a hotel that meets the user's constraints, booking the room and paying for the room.HRL decomposes complex tasks into several subtasks and learns different policies for these subtasks from top to lowlevel (Budzianowski et al., 2017;Peng et al., 2017;Kristianto et al., 2018).As shown in figure 3, the top-level policy decides which option (i.e.subtask) w ∈ Ω should be chosen, and the low-level dialogue policy selects the primitive actions a ∈ A to complete the subtask given by the top-level policy.It is noted that a primitive action is an action lasting for one time step, while an option is an action lasting for several time steps.According to the realm of the top-level policy, HRL can be further divided into sub-domain or sub-goal hierarchical reinforcement learning.
Sub-domain.Peng et al. (2017); Budzianowski et al. (2017) used the options framework (Sutton et al., 1999) to solve the above problem with different approximators.However, each option (i.e.subtask) and its property (e.g. starting and terminating conditions, and valid action set) had to be manually defined in their works.Kristianto et al. (2018) proposed a unified framework that integrates option discovery (Bacon et al., 2016;Machado et al., 2017) and achieved a comparable performance with manually defined options framework.
Sub-goal.Instead of decomposing a task according to the corresponding domain, it is also an option to divide a complex goal-oriented task into a set of simpler subgoals.Tang et al. (2018) proposed the Subgoal Discovery Network (SDN) that discovers and exploits the hidden structure of the task to enable efficient policy learning inspired by the sequence segmentation model (Wang et al., 2017).

Model-Free RL -FRL
Feudal Reinforcement Learning (FRL) (Dayan and Hinton, 1992) is another interesting attempt to solve the large state and action space problem.FRL decomposes a task spatially to restrict the action space of each sub-policy, but the above mentioned HRL decompose a task temporarily to solve a different sub-task at a different time step (Gao et al., 2018;Dai et al., 2020).(Casanueva et al., 2018a) firstly applied FRL to task-oriented dialogue systems and decomposes the decision into two steps based on its relevance with slots: a master policy is chosen to select a subset of primitive actions at the first step, and a primitive action is chosen from the selected subset at the second step.The decisions in different steps use different parts of the abstracted states.Furthermore, (Casanueva et al., 2018b) showed that the feature extraction can be learned jointly with the policy model while obtaining similar performance, even outperforming the handcrafted features in feudal dialogue management.
In contrast to the HRL that decompose a task into temporally separated subtasks, FRL decomposes a complex decision spatially (Gao et al., 2018).Although both HRL and FRL can be used to address large dimension issues, they both have their notorious limitation: the decomposition in HRL often requires expert knowledge while FRL does not consider the mutual constraints between sub-tasks (Dai et al., 2020).

Model-Based RL
Different from model-free RL methods, modelbased RL models the environment to decide the transition of states, enabling planning for dialogue policy learning (Zhang et al., 2020b).Deep Dyna-Q (DDQ) (Peng et al., 2018b) is the first deep RL framework that integrates planning for taskcompletion DPL, which effectively leverages a small number of real conversations.Specifically, the environment is modelled as a world model to  mimic the real user response and generate simulated experience.Recently, more DDQ variants have been proposed to improve the quality of simulated experience by adversarial training (Su et al., 2018), active learning (Wu et al., 2018) and human teaching (Zhang et al., 2019).

Warm-up by Imitation Learning
Imitation Learning (IL) allows the policy to imitate directly from the expert demonstrations without exploring the environment, leading to an effective initialization at the warm-up stage (Abbeel and Ng, 2004).With limited warm-up steps based on a few expert demonstrations, the learning speed of the dialogue RL agent can be accelerated (Su et al., 2016a;Fatemi et al., 2016).However, another line of works points out that IL requires expert demonstrations and the transition dynamics of the RL environment to have the same distribution, which is often not the case in DPL.Thus, it's critical to follow up the IL with different RL methods (Liu and Lane, 2017;Peng et al., 2018b).

State Space
The dialogue state encodes the essential information in the dialogue history for the dialogue policy to generate the next system action.There are mainly two types of states representation that were used by recent researches.They are the multi-hot representation and the distributed representation.
Most works using the multi-hot representation are based on a belief vector that simply concatenates the one-hot vector for each slot (Takanobu et al., 2019(Takanobu et al., , 2020a;;Xu et al., 2020;Jhunjhunwala et al., 2020).These multi-hot representations are often simple to implement but require features engineering.On the other hand, some works (Liu and Lane, 2017;Wu et al., 2018;Peng et al., 2018b) adopted the approach in Mrkšić et al. (2017) where the state representations were directly learned from user's utterances.Saha et al. (2020) extended the state representation with multi-modal information.They added image and sentiment representations into the state.This approach requires no human intervention and enables to handle variations (Mrkšić et al., 2017).

Action Space
Most works treat the action space as the set of dialogue acts.A dialogue act is specified by a dialogue act type which indicates the type of action the user/agent is performing, and a set of slot-value pairs that specify the imposed constraints (De Mori, 2007).Chen et al. (2019b) pointed out that having a separate set of dialogue acts for each domain is not scalable as we work towards multi-domain largescale scenarios.They alleviated this problem by building a multi-layer hierarchical graph to exploit the structure of dialogue acts.While this work has avoided the dialogue acts to grow exponential with the number of domains, Zhao et al. (2019) took another approach to treat the action space as a latent variable and use an unsupervised method to induce an appropriate action space from the data.
At the other end of the spectrum, some works represent dialogue acts as sequences and formulate the dialogue act prediction problem as a sequence generation problem (Shu et al., 2019).The advantage of this method is its ability to output multiple actions per turn.Most existing methods for DPL that are formulated as a classification problem can only predict one system action per turn.
Below, we present two streams of work that aim to learn a denser reward to encourage faster learning in RL making using of the provided expert demonstrations: inverse reinforcement learning IRL based methods and reward shaping.Figure ?? shows the overview of the pipeline of IRL methods and reward shaping.

Inverse Reinforcement Learning Method
IRL is a fundamental technique to learn a reward function that underlies the expert demonstrations (Russell, 1998).Boularias et al. (2010) is the first to explore this idea in DPL to learn a reward function from a human expert in a Wizard-of-Oz setting.The proposed a reward function which is a linear combination of feature vectors with unknown weights.The weights can be first learned from the expert demonstrations, then the learned reward function is used in RL.The learned reward function can provide meaningful feedback to the policy which helps it to learn effectively especially in the early stage.
IRL is often expensive to run which hinders it to scale to a more complex dialogue scenario (Ho and Ermon, 2016).In the RL community, Adversarial IRL (AL-IRL) is proposed to enhance the learning efficiency to learn the reward from expert demonstrations (Ho and Ermon, 2016).Liu and Lane (2018) explored AL-IRL in DPL and use the discriminator to differentiate successful dialogues from unsuccessful ones.Extending this line of research, Takanobu et al. (2019) further combined AL with maximum entropy IRL to learn the policy and reward estimator alternatively.

Reward Shaping
Reward shaping aims to incorporate domain knowledge into RL by introducing an extra reward in addition to the reward provided by the environment (Ng, 1999).Ferreira and Lefèvre (2013) learned an extra reward from the social cues of the user.In this work, they mainly consider the sentiment cues from the user-defined manually including the type of dialogue acts, number of slots filled, agenda size etc.While this method doesn't need extra annotated data, the manually defined features are not scalable to other domains.Wang et al. (2020a) took advantage of human demonstrations and use a multi-variate Gaussian to pick the most similar state-action pair to complement the main reward.On the whole, these papers highlight the benefit of using a dense reward in DPL.An important difference between inverse reinforcement learning method and reward shaping is that the former learns one single reward function while the latter adds a reward function in addition to the main reward provided by the environment.

Future Direction
As the objective of a TDS is to help user to achieve their goal, future researches should aim toward applying TDS in a real-world scenario.There are two main obstacles in our way: the data scarcity problem which can be solved by either domain adaptation or meta policy learning, and lack of robustness in evaluation.Data Scarcity.There are many different types of real-world dialogue scenarios such as restaurant booking, weather query, and flight booking etc.It is extremely costly to obtain a large amount of annotated data for different domains.However, the most recent methods presented in this survey often requires a lot of expert demonstrations.As a result, for a TDS to be applicable, we should develop techniques and methods to learn a dialogue policy efficiently and effectively in domains that have scarce data.Domain Adaptation and Meta Policy Learning are two effective and promising solutions to tackle this problem.Evaluation Robustness.It is very important to evaluate the performance of a dialogue policy in assisting humans to complete some tasks.Currently, the most widely used way to evaluate a dialogue policy is to use a user simulator to interact with the dialogue agent and compute some metrics over it.This evaluation method does not correctly reflect how good a dialogue policy can assist a human in completing their task.Below we outline two promising future directions in tackling the data scarcity problem and our insight on a better evaluation method.
8.1 Data Scarcity Problem Domain Adaptation.Domain adaptation or policy transfer allows us to build a dialogue policy in a target domain that has scarce data provided with a large amount of data in a source domain.Chen et al. (2018) proposed a multi-agent dialogue policy (MADP) that consists of some slot-dependant agents that have shared parameters for every slot.Those shared parameters can be transferred to a new domain for those common slots.In a similar fashion, Ilievski et al. (2018) matched the state space and action space between the source domain and target domain even if those actions/slots are never used in the source domain.The parameters of the common slots and actions are used in the target domain initially.However, different domains don't necessarily have common actions or consistent dialogue act naming.Mo et al. (2018) proposed a PROMISE model that learns the similarity between the slots and actions of different domains.While these researches focus on domain adaptation between two domains, more works need to be done on adapting to multi-source domains.Meta Policy Learning.To further extend the usage of DPL to a real-world scenario, we should consider situations that have an even harsher data resource.In the previous section, we discuss the direction that leverages the abundant data in a source domain.In this section, we consider the metalearning paradigm that tackles the situation that all domains have scarce data.Recently, Mi et al. (2019) adopted meta-learning in NLG module in the SDS pipeline.Inspired by this work, Xu et al. (2020) proposed Deep Transferable Q-Network (DTQN) that leverages shareable features across domains.They further combine DTQN with Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017) with a dual-replay mechanism to support effective offpolicy learning which helps models to adapt to an unseen domain quickly.Zhang et al. (2019) extended DDQ by incorporating Budget-Conscious Scheduling to learn from a fixed, small amount of interactions.It uses a decayed poisson process to model the number of interactions allocated to each epoch, where the total number of epochs is predefined.More works are needed to explore efficient learning methods in TDS under the meta-learning paradigm.

Evaluation
In DLP research, Walker et al. (1997) is the first to present a general framework to evaluate the performance of a dialogue agent.They evaluate a dialogue from two aspects.One is the dialogue cost which measures the cost induced by the dialogue (e.g.number of turns) and another one is task success which evaluates whether the dialogue agent successfully accomplish the task from the user by comparing it with the user goal.In practice, the dialogue policy is often evaluated by having conversations with a simulated user by the metrics such as inform F1, success rate, bleu score (Takanobu et al., 2020b).The problem is that the simulator doesn't resemble human conversation behaviour well as discussed in Section 3. Therefore, there is still a gap between human evaluation and simulated evaluation (Takanobu et al., 2020b).We believe that much work is needed to provide a universal evaluation framework that should be used for any general TDS.Instead of using metrics that compare the dialogue act with the simulated goal, a universal evaluation framework should emphasize the overall satisfaction of a human user.Such a framework should include but not limited to ways to measure how natural or helpful is the response of the dialogue agent to the user.

Conclusion
In this survey, we introduce the recent advancement of RL approaches applied in DPL of TDS, which focus on tackling the three main challenges.Given the vast amount of works in such areas in recent years, a typology of approaches is needed to identify the main focal research directions in applying RL in DPL.We contribute such a typology that is based on which of the five RL elements the approaches are adapting.As we are moving to apply TDS in real-world scenarios, the scarce data of various dialogue scenarios and the lack of robust evaluation of dialogue agents will be the most prominent obstacles.To this end, three fruitful research directions are suggested to tackle them respectively.

A Procedure for Shortlisting Papers
We use a two-step procedure to shortlist relevant papers for review.In the first step, we use two tools to search relevant papers.The two tools are 1) AMiner1 which can provide literature that dates back to 1922 given a topic keyword and 2) Connected Papers2 to provide us with a graph of strongly connected papers given a seed paper.We use Aminer with the keyword "dialogue policy" to search for papers within the recent ten years.Among the returned list of papers, we use each one as a seed paper as input to Connected Papers and further select related papers from the provided graph.Then we go through the papers manually and select those that apply RL methods in DPL of TDS as the preliminary papers.In the second step, we go through the references of the preliminary papers and pick relevant ones.

Figure 1 :
Figure 1: A overview of task-oriented dialogue system.All blue parts represent the four components in pipeline dialogue system.

Figure 2 :
Figure 2: The framework of Markov Decision Process in DPL.At time t, the system takes an action a t , receiving a reward r t and a terminate signal t and then transferring to a new state s t+1 .

Figure 3 :
Figure 3: The overview of two levels of policies in hierarchical reinforcement learning, Peng et al. (2017)

Figure 4 :
Figure 4: The RL architecture of using imitation learning.

Figure 5 :
Figure 5: Two strategies to learn a denser reward.