Survey on evaluation methods for dialogue systems

In this paper, we survey the methods and concepts developed for the evaluation of dialogue systems. Evaluation, in and of itself, is a crucial part during the development process. Often, dialogue systems are evaluated by means of human evaluations and questionnaires. However, this tends to be very cost- and time-intensive. Thus, much work has been put into finding methods which allow a reduction in involvement of human labour. In this survey, we present the main concepts and methods. For this, we differentiate between the various classes of dialogue systems (task-oriented, conversational, and question-answering dialogue systems). We cover each class by introducing the main technologies developed for the dialogue systems and then present the evaluation methods regarding that class.


Introduction
As the amount of digital data continuously grows, users demand technologies that offer quick access to such data. In fact, users rely on systems that support information search interactions such as Siri 1 , Google Assistant 2 , Amazon Alexa 3 or Microsoft XiaoIce (Zhou et al. 2018), etc. These technologies, called Dialogue Systems (DS), allow the user to converse with a computer system using natural language. Dialogue Systems are applied to a variety of tasks, e.g.: • Virtual Assistants aid users in everyday tasks, such as scheduling appointments. They usually operate on predefined actions which can be triggered by voice command. • Information-seeking systems provide users with information about a question (e.g. the most suitable hotel in town). These questions also include factual questions as well as more complex questions. • E-learning dialogue systems train students for various situations. For instance, they train the interaction with medical patients or train military personnel in questioning a witness.
One crucial step in the development of DS is evaluation. That is, to measure how well the DS is performing. However, evaluating a dialogue system can prove to be problematic because there are two important factors to be considered. Firstly, the definition of what constitutes a high-quality dialogue is not always clear and often depends on the application. Even if a definition is assumed, it is not always clear how to measure it. For instance, if we assume that a high-quality dialogue system is defined by its ability to respond with an appropriate utterance, it is not clear how to measure appropriateness or what appropriateness means for a particular system. Moreover, one might ask the users if the responses were appropriate, but as we will discuss below, user feedback might not always be reliable for a variety of reasons. The second factor is that the evaluation of dialogue systems is very cost-and timeintensive. This is especially true when the evaluation is carried out by a user study, which requires careful preparation, the need for inviting and compensating users for their participation.
Over the past decades, many different evaluation methods have been proposed. The evaluation methods are closely tied to the characteristics of the dialogue system which they are aimed at evaluating. Thus, quality is defined in the context of the function which dialogue system is meant to fulfil. For instance, a system designed to answer questions will be evaluated on the basis of correctness, which is not necessarily a suitable metric for evaluating a conversational agent.
Most methods are aimed at automating the evaluation, or at least automating certain aspects of the evaluation. The goal of an evaluation method is to obtain automated and repeatable evaluation procedures that allow efficient comparisons in the quality of different dialogue strategies.
This survey is structured as follows; in the next section we give a general overview over the different classes of dialogue systems and their characteristics. We then introduce the evaluation task in greater detail, with an emphasis on the goals of an evaluation and the requirements on an evaluation metric. In Sects. 3, 4, and 5, we introduce each dialogue system class (i.e. task-oriented systems, conversational agents and question answering dialogue systems). Thereafter, we give an overview of the characteristics, dialogue behaviour, and concepts behind the implementation methods of the various dialogue systems. Finally, we present the evaluation methods and the ideas behind them. Here, we set an emphasis the relationship between these methods and the dialogue system classes, including which aspects of the evaluation are automated. In Sect. 6, we give a short overview of the relevant datasets and evaluation campaigns in the domain of dialogue systems. In Sect. 7, we discuss the issues and challenges in devising automated evaluation methods and discuss the level of automation achieved.

Dialogue systems
Dialogue Systems (DS) usually structure dialogues in turns, each turn is defined by one or more utterances from one speaker. Two consecutive turns between two different speakers is called an exchange. Multiple exchanges constitute a dialogue. Another different, but related view is to interpret each turn or each utterance as an action (more on this later).
The main component of a dialogue system is the dialogue manager that defines the content of the next utterance and thus the behaviour of the dialogue system. There are many different approaches to design a dialogue manager, which are partly dictated by the application of the dialogue system. However, there are three broad classes of dialogue systems that we encounter in the literature: task-oriented systems, conversational agents and interactive question answering systems 4 . We identified the following characteristic features that help differentiate between the three different classes: whether the system is developed to solve a task, whether the dialogue follows a structure, whether the domain is restricted or open, whether the dialogue spans over multiple turns, whether the dialogues are long or rather efficient, who takes the initiative, and what interface is used (text, speech, multi-modal). Table 1 depicts the characteristics for each of the dialogue system classes. In this table, we can see the following main features for each class: • Task-oriented systems are developed to help the user solve a specific task as efficiently as possible. The dialogues are characterized by following a clearly defined structure that is derived from the domain. The dialogues follow mixed initiative; both the user and the system can take the lead. Usually, the systems found in the literature are built for speech input and output. However, task-oriented systems in the domain of assisting users are built on multi-modal input and output. • Conversational agents display a more unstructured conversation, as their purpose is to have open-domain dialogues with no specific task to solve. Most of these systems are built to emulate social interactions, and thus longer dialogues are desired. • Question Answering (QA) systems are built for the specific task of answering questions. The dialogues are not defined by a structure as with task-oriented systems, however, they mostly follow the question and answer style pattern. QA systems may be built for a specific domain, but may be also tilted towards more open domain questions. Usually, the domain is dictated by the underlying data, e.g. knowledge bases or text snippets from forums. Traditional QA systems work on a singe-turn interaction, however, there are systems that allow multiple turns to cover follow-up questions. The initiative is mostly done by the user, who asks questions.

Evaluation
Evaluating dialogue systems is a challenging task and subject of much research. We define the goal of an evaluation method as having an automated, repeatable evaluation procedure with high correlation to human judgments, which is able to differentiate between various dialogue strategies and is able to explain which features of the dialogue systems are important. Thus, the following requirements can be stated: • Automatic in order to reduce the dependency on human labour, which is time-and costintensive as well as not necessarily repeatable, the evaluation method should be automated, or at least partially automated. • Repeatable the evaluation method should yield the same result if applied multiple times to the same dialogue system under the same circumstances. • Correlated to human judgments the procedure should yield ratings that correlate to human judgments. • Differentiate between different dialogue systems the evaluation procedure should be able to differentiate between different strategies. For instance, if one wants to test the effect of a barge-in feature (i.e. allowing the user to interrupt the dialogue system), the evaluation procedure should be able to highlight the effects.
• Explainable the method should give insights into which features of the dialogue system impact the quality of the dialogue and in which manner they do so. For instance, the methods should reveal that the automatic speech recognition system's word-error rate has a high influence on the quality of the natural language understanding component, which in turn impacts the intent classification.
In this survey, we focus on the efforts of automating the evaluation process. This is a very difficult, but crucial task, as human evaluations are cost-and time-intensive. Although much progress has been made in automating the evaluations of dialogue systems, the reliance on human evaluation is still present. Here, we give a condensed overview on the human-based evaluations used in the literature.
Human evaluation There are various approaches to a human evaluation. The test subjects can take on two main roles: interacting with the system or rating a dialogue or utterance, or both. In the following, we differentiate among different types of user populations. Among each of the populations, the subjects can take on any of the two roles.
• Lab experiments Before crowdsourcing was popular, dialogue systems were evaluated in a lab environment. Users were invited to participate in the lab where they interacted with a dialogue system and subsequently filled a questionnaire. For instance, Young et al. (2010) recruited 36 subjects, which were given instructions and presented with various scenarios. The subjects were asked to solve a task using a spoken dialogue system. Furthermore, a supervisor was present to guide the users. The lab environment is very controlled, which is not necessarily comparable to the real world (Black et al. 2011;Schmitt and Ultes 2015). • In-field experiments Here, the evaluation is performed by collecting feedback from real users of the dialogue systems (Lamel et al. 2000). For instance, for the Spoken Dialogue Challenge (Black et al. 2011), the systems were developed to provide bus schedule information in Pittsburgh. The evaluation was performed by redirecting the evening calls to the dialogue systems and getting the user feedback at the end of the conversation. The Alexa Prize 5 also followed the same strategy, i.e. it let real users interact with operational systems and gathered user feedback over a span of several months. • Crowdsourcing Recently, human evaluation has shifted from a lab environment to using crowdsourcing platforms such as Amazon Mechanical Turk (AMT). These platforms provide large amounts of recruited users. Jurcícek et al. (2011) evaluate the validity of using crowdsourcing for evaluating dialogue systems, and their experiments suggest that using enough crowdsourced users, the quality of the evaluation is comparable to the lab conditions. Current research relies on crowdsourcing for human evaluation Wen et al. 2017). Especially conversational dialogue systems are evaluated via crowdsourcing, where there are two main evaluation procedures: crowdworkers either talk to the system and rate the interaction or they are presented with a context from the test set and a response by the system, which they need to rate. In both settings, the crowdworkers are aksed to rate the system based on quality, fluency or appropriateness. Recently, Adiwardana et al. (2020) introduced Sensibleness and Specificity Average (SSA), where humans rate the sensibleness and specificity of a response. These capture two aspects of human 1 3 behaviour: making sense and being specific. A dialogue system can be sensible by responding with vague answers (e.g. "I don't know"), whereas it is only specific if it takes the context into account.
Human based evaluation is difficult to set up and to carry out. Much care has to be taken in setting up the experiments; the users need to be properly instructed and the tasks need to be prepared so that the experiment reflects real-world conditions as closely as possible. Furthermore, one needs to take into account the high variability of user behaviour, which is present especially in crowdsourced environments.
Automated evaluation procedures A procedure which satisfies the aforementioned requirements has not yet been developed. Most evaluation procedures either require a degree of human involvement in order to be somewhat correlated to human judgement, or they require significant engineering effort. The evaluation methods, which we cover in this survey, can be categorized as follows: model the human judges, model the user behaviour, or use fine-grained methods, which evaluates a specific aspect of the dialogue system (e.g. its ability to adhere to a topic). Methods that model human judges rely on human judgements to be collected beforehand so as to fit a model which predicts the human rating. User behaviour models involve a significant engineering step in order to build a model which emulates the human behaviour. The finer-grained methods also need a certain degree of engineering, which depends on the feature being evaluated. The common trait of the evaluation methods covered in this survey is that they are coupled to the characteristics of the dialogue system that are being considered. That is, a task-oriented dialogue system is evaluated differently to a conversational dialogue system.

Modular structure of this article
Different evaluation procedures have been proposed based on the characteristics of the dialogue system class. For instance, the evaluation of task-oriented systems exploits the highly structured dialogues. The goal can be precisely defined and measured to compute the tasksuccess rate. On the other hand, conversational agents generate dialogues that are more unstructured, which can be evaluated on the basis of appropriateness of the responses; this has been shown to be difficult to automate. We introduce each type of dialogue system to highlight the respective characteristics and methods used to implement the dialogue system. With this knowledge, we introduce the most important concepts and methods developed to evaluate the respective class of dialogue system. In the following survey, we discuss each of the three classes of dialogue systems separately. Thus, Sect. 3: Task Oriented Dialogue Systems, Sect. 4: Conversational Agents, and Sect. 5: Interactive Question Answering can be read independently from each other.

Characteristics
As the name suggests, a task-oriented dialogue system is developed to perform a clearly defined task. These dialogue systems are usually characterized by a clearly defined and measurable goal, a structured dialogue behaviour, a closed domain to work on and a focus on efficiency. Usually, the task involves finding information within a database and returning it to the user, performing an action, or retrieving information from its users. For instance, a restaurant information dialogue system helps the user to find a restaurant which satisfies the user's constraints. Furthermore, task-oriented dialogue systems also serve as interfaces to program APIs, which is often used in the Smart Home setting (Möller et al. 2004). For example, an in-car entertainment dialogue system can be ordered to start playing music via voice commands or querying the agenda (see Fig. 1 for an example).
The commonality is that the dialogue system infers the task constraints through the dialogue and retrieves the information requested by the user. For a ticket reservation system, the dialogue system needs to know the origin station, the destination, and the departure date and time. In most cases, the dialogue system is designed for a specific domain, such as restaurant information. The nature of these dialogue systems makes the dialogues both very structured and tailored. The ideal dialogue satisfies the user goal with as few interactions as possible. The dialogues are characterized by mixed initiatives, the user states its goal but the dialogue system proactively asks questions to retrieve the required constraints.  (Eric et al. 2017). The dialogue system guides the driver through the various options

Dialogue structure
The dialogue structure for task-oriented systems is defined by two aspects: the content of the conversation and the strategy used within the conversation. Content The content of the conversation is derived from the domain ontology. The domain ontology is usually defined as a list of slot-value pairs. For instance, Table 2 shows the domain ontology for the restaurant domain (Novikova et al. 2017). Each slot has a type and a list of values, which the slot can be filled with.
Strategy While the domain ontology defines the content of the dialogue, the strategy to fill the required slots during the conversation is modelled as a sequence of actions (Austin 1962). These actions are so-called dialogue acts. A dialogue act is defined by its type (e.g. inform, query, confirm, and housekeeping) and by the list of arguments it can take. Each utterance corresponds to an action performed by an interlocutor. Table 3 shows the dialogue acts proposed by Young et al. (2010). For instance, the inform act is used to inform the user about its arguments, i.e. inform(food = "French", area = "riverside") informs the user that there is a French Implicitly confirm a = x, … and request value of d select(a = x, a = y) Select either a = x or a = y affirm(a = x, b = y) Affirm and give further info a = x, b = y, … negate(a = x) Negate and give corrected value a = x deny(a = x) Deny that a = x bye() Close a dialogue inform(food = Italian,near=museum) S: Roma is a nice Italian restaurant near the museum.
inform(name = "Roma", type = restaurant, food = Italian, near = museum) U: Is it reasonably priced? confirm(pricerange = moderate) S: Yes, Roma is in the moderate price range. affirm(name = "Roma", pricerange = moderate) U: What is the phone number? request(phone) S: The number of Roma is 385456.
inform(name = "Roma", phone = "385456") U: Ok, thank you goodbye. bye() restaurant at the riverside area. On the other hand, the request act is used to request a value for a given list of slot-value pairs. Table 4 shows an example dialogue with the corresponding dialogue acts. Each user utterance is translated into a dialogue act, and each dialogue act of the dialogue system is translated into an utterance in natural language. For instance, the utterance "Hi, I am looking for somewhere to eat" corresponds to the act of "hello". The parameters describe the task that the user intends to solve, i.e. find a restaurant. For a formal description of dialogue acts, refer to Traum (1999); Young (2007).

Technologies
We have just seen that content and strategy are the two main aspects driving the structure of a dialogue, but their influence reaches down to the different functionalities making a classic dialogue system architecture. It is composed of several parts which are built around the idea of modelling the dialogue as a sequence of actions.
The central component is the so-called dialogue manager. It defines the dialogue policy, which consists in deciding which action to take at each dialogue turn. The input to the dialogue manager is the current state of the conversation. The output of the dialogue manager is a dialogue act, which represents the system's action. Other components convert the user's input into a dialogue act and the dialogue manager's output into a natural language utterance.
Usually, the user's input is processed by a natural language understanding (NLU) unit, which extracts the slots and their values from the utterance and identifies corresponding the dialogue act. This information is passed to the dialogue state tracker (DST), which infers the current state of the dialogue. Finally the output of the dialogue manager is passed to a natural language generation (NLG) component.
Traditionally, these components were assembled into a pipelined architecture, but recent approaches based on trainable end-to-end neural networks offer a promising alternative. In the following, we briefly introduce the modules of the pipelined architecture and the deep neural network based approach.

Pipelined systems
Usually, these four components are put into a pipelined architecture, where the output of one component is fed as the input into the next component (see Fig. 2). The input of the dialogue system is either a chat-interface or an automatic speech recognition (ASR) system. The input to the NLU unit is the utterance of the user in text format or, in the case of automatic speech recognition (ASR) a list of the N-best last user utterance transcriptions.
Natural language understanding The goal of the natural language understanding (NLU) unit is to detect the slot-value pairs expressed in the current user utterance. Since the early 2000s, the natural language understanding task is often seen as a set of subtasks (Tur and Mori 2011) as follows: (i) identification of domain (if multiple domains), (ii) identification of intents (that is, the question type, the dialogue act, etc.) and (iii) identification of the slots or concept detection.
In an utterance such as, "I want to book a hotel room for Monday, 8th", the domain is hotel, the intent hotel booking and the slot-value pair is date(Monday, 8th). The first two tasks are formalized as a classification task and any classification methods may be used. For concept detection one makes use of sequence labelling methods such as Conditional Random Field (CRF) (Hahn et al. 2010) or recurrent neural network, typically bi-LSTM with CRF layer (Yao et al. 2014;Mesnil et al. 2015). Recent methods propose to jointly learn the tasks of intent identification and concept detection (Guo et al. 2014;Zhang and Wang 2016). Usually, NLU is performed on classifying the intents that lie within the domain for which the dialogue system is developed for. Larson et al. (2019) introduce an out-of-scope intent classification task, where the NLU system is trained to detect if a user intent does not lie within the scope of the dialogue systems' capabilities. Dialogue state tracking The Dialogue State Tracker (DST) infers the current belief state of the conversation, given the dialogue history up to the current point t . The current belief state encodes the user's goal (e.g. which price range the user prefers) and the relevant dialogue history, i.e. it is an internal representation of the state of the conversation. It is important to take the previous belief states into account in order to handle misunderstandings. For instance, in Fig. 3, the confidence that the user wants an Italian restaurant is low. In the successive turn, the ASR system still gives low confidence to the Italian restaurant. However, since the state tracker takes into account that the Italian restaurant could have been mentioned in the previous turn, it assigns a higher overall probability to it.
The main challenge for the DST module is to handle the uncertainty, which stems from the errors made by the ASR module and the NLU unit. Typically, the output of the DST unit is represented as a probability distribution over multiple possible dialogue states b(s), which provides a representation of the uncertainty. Generative methods have been widely used to manage this task, for example, dynamic Bayesian network (DBN) along with a beam search ). Those methods present some limits which are widely discussed in Metallinou et al. (2013), the most important being that all the correlations in the input features have to be modeled (even the unseen cases).
Discriminative models were then proposed to overcome these limits. Metallinou et al. (2013) proposed to use a linear classifier with the dialogue history present in the input features. Whereas Henderson et al. (2013b) proposed to map directly the ASR hypotheses onto a dialogue state by means of recurrent neural networks. This way, both NLU and DST were integrated into a single function. Nowadays, neural approaches are becoming more and more popular .
Strategy The strategy is learned by the dialogue manager. The input is the current belief state b(s) computed by the DST module. The DM generates the next action of the system, which is represented as a dialogue act. In other words, based on the current turn values and on the value history the system performs an action (e.g. retrieve data from a database, ask for a missing information, etc.). Deciding which action to take is part of the dialogue control.
In earlier systems, the dialogue control was based on a finite-state automaton in which the nodes represent the questions of the system and the transitions the possible user's answers. This method, while being rigid, is efficient when the domain and the task are simple. It has been widely used to design dialogue systems and many toolkits are available such as the one from the Center for Spoken Language Understanding (Cole 1999) or VoiceXML. 6 The main issue is the rigid dialogue structure as well as the tendency to be error-prone. In fact, such a system does not model discourse phenomena like ellipsis (a part of the sentence structure that can be inferred from the context is omitted) or anaphoric references (which can be resolved only in a given context).
To overcome these inefficiencies, a dialogue manager is designed to keep track of the interaction history and controls the dialogue strategy. This is called frame-based dialogue control and management. Frame-based techniques rely on schemas specifying what the system has to solve instead of representing what the system has to do and when. This allows for dialogue to be more flexible and the possibility to handle errors (McTear et al. 2005;van Schooten et al. 2007).
Initially, dialogue managers were implemented using rule-based approaches. When data had become available in sufficient amount, data-driven methods were proposed for learning dialogue strategies from data. The dialogue is represented as a Markov decision problem (MDPs), following the intuition that a dialogue can be represented as a sequence of actions (Levin et al. 1998;Singh et al. 2000). These actions are referred to as speech acts or dialogue acts (Austin 1962;Searle 1969Searle , 1975. However, MDPs cannot handle uncertainty coming from speech recognition errors . Thus, partially observable MDPs (POMDP) were adopted, as they introduce the belief state, which models the uncertainty of the current state (Paek 2006;Lemon and Pietquin 2012;Young et al. 2013). Although this alleviated the problem of hand-crafting the dialogue policy, the domain ontology still needs to be manually created. Furthermore, these dialogue systems are trained on a static and well-defined domain, once trained the policy works only on this domain. Finally, the dialogue systems need large amounts of data to be trained efficiently, mostly using user simulation for training (Schatzmann et al. 2006). Beyond user simulations, Gašić et al. (2011) showed that online policy learning based on crowdsourcing is a valid alternative.
To mitigate the issues arising from the lack of data, Gašić et al. (2011) applied Gaussian processes for POMDP-based optimization (Engel et al. 2005), which exploits the correlation between different belief states and speeds up the learning process. The authors showed that a reasonable policy can be learned with online user feedback after a few hundred dialogues. Gasic et al. (2013Gasic et al. ( , 2014 showed that it is possible to adapt the policy if the domain is extended dynamically. Note also the work of Wang et al. (2015) which aims at enabling domain-transfer by introducing a domain-independent ontology parametrisation framework.
Natural language generation The natural language generation (NLG) module translates the dialogue act represented in a semantic frame into an utterance in natural language (Rambow et al. 2001). The task of NLG is usually divided into separate subtasks such as content selection, sentence planning, and surface realization (Stent et al. 2004). Traditionally, the task has been solved by relying on rule-based methods and canned texts. Statistical methods were also proposed and used, such as phrase-based NLG with statistical language models  or CRF based on semantic trees (Dethlefs et al. 2013). Recently, deep learning techniques have become more prominent for NLG. With these techniques, there now exists a large variety of different network architectures, each addressing a different aspect of NLG; Wen et al. (2015) propose an extension to the vanilla LSTM (Hochreiter and Schmidhuber 1997) to control the semantic properties of an utterance, whereas Hu et al. (2017) use variational autoencoder (VAE) and generative adversarial networks to control the generation of texts by manipulating the latent space; Mei et al. (2016) employ an encoder-decoder architecture extended by a coarse-to-fine aligner to solve the problem of content selection; Wen et al. (2016) apply data counter-fitting to generate out-of-domain training data for pretraining a model where there is little in-domain data available; Semeniuta et al. (2017) and Bowman et al. (2016) use a VAE trained in an unsupervised fashion on large amounts of data to sample texts from the latent space; and Dušek and Jurcicek (2016) use a sequence-to-sequence model with attention to generate natural language strings as well as deep syntax dependency trees from dialogue acts.

End-to-end trainable systems
Traditionally, task-oriented dialogue systems were designed along the pipelined architecture, where each module has to be designed, trained, and evaluated separately. There are several drawbacks to this approach. As the architecture is modular, each component needs to be designed separately, which involves lots of hand-crafting, the costly generation of annotated data for each module, and training each component . Furthermore, the pipelined architecture leads to the propagation and amplification of errors through the pipeline as each module depends on the output of the previous module (Li et al. 2017b;Liu et al. 2018).
Related to the architecture there is a credit assignment problem, as the dialogue system is evaluated as a whole, it is hard to determine what module is responsible for which reward. Furthermore, this architecture leads to interdependence among the modules, i.e. when one module is changed, all the subsequent modules need to be adapted as well (Zhao and Eskenazi 2016).
Finally, the slot-filling architecture, which is often used, makes these systems inherently hard to scale to new domains since there is a need to hand-craft the representation of the state and action space .
To overcome these limitations, current research focuses on end-to-end trainable architectures where the dialogue system is trained as a single module. Wen et al. (2017) model the dialogue as a sequence to sequence mapping, where the traditional pipeline elements are modelled as interacting neural networks. The policy network takes as input the results form the intent network, the belief tracker network, the database operator and selects the next action, based on the selected action, the generation network produces the output utterance. Bordes et al. (2017) propose a set of synthetic tasks to evaluate the feasibility of end-toend models in the task-oriented setting, for which they use a memory network to model the conversation. These approaches learn the dialogue policy in a supervised fashion from the data. In contrast the work by Li et al. (2017b); Zhao and Eskenazi (2016) train the system using reinforcement-learning. Note that all these approaches rely on huge amounts of training data.

Evaluation
The evaluation of task-oriented dialogue systems is built around the structured nature of the interaction. Two main aspects are evaluated, which have been shown to define the quality of the dialogue: task-success and dialogue efficiency. Two main metrics of evaluation methods have been proposed: • User satisfaction modelling Here, the assumption is that the usability of the system can be approximated by the satisfaction of its users, which can be measured by questionnaires. These approaches aim to model the human judgements, i.e. creating models which give the same ratings as the human judges. First, a human evaluation is performed where subjects interact with the dialogue system. Afterwards, the dialogue system is rated via questionnaires. Finally, the ratings are used as target labels to fit a model based on objectively measurable features (e.g. task success rate, word error rate of the ASR system). • User simulation Here, the idea is to simulate the behaviour of the users. There are two applications of user simulation: firstly, to evaluate a functioning system with the goal of finding weaknesses and secondly, the user simulation is used as an environment to train a reinforcement-learning based system. The evaluation in the latter is based on the reward achieved by the dialogue manager under the user simulation.
Both these approaches rely on measuring task-success rate and dialogue efficiency. Before we introduce the methods themselves, we will go over the ways to measure performance along these two dimensions.
Task-success rate The goal or the task of the dialogue can be split into two parts ) (see Fig. 4) as follows: • Set of Constraints, which define the target information to be retrieved. For instance, the specifications of the venue (e.g. a bar in the central area, which serves beer) or the travel route (e.g. ticket from Torino to Milano at 8pm).  2007) and Walker et al. (1997). Where C 0 denotes the information constraints, i.e. which information is to be retrieved (a bar that serves beer in the city center). R 0 denotes the set of requests, i.e. the information the user wants (name, address, and phone number) Table 5 Confusion matrix from Walker et al. (1997) For each key (e.g. depart-city) a confusion matrix is created, which denotes the expected values (row) and the values produced by the dialogue system (columns). The maximum value of each column is represented in bold. For instance, if it was expected that the dialogue system returns the train schedule from Torino to Milano but it confused the depart-city with Verona, then this is counted as an error • Set of Requests, which define what information the user wants. For instance the name, address and the phone number of the venue.
The task-success rate measures how well the dialogue system fulfills the information requirements dictated by the user's goals. For instance, this includes whether the correct type of venue has been found by the dialogue system and whether the dialogue system returned all the requested information. One possibility to measure this is via a confusion matrix (see Table 5), which represents the errors made over several dialogues. Based on this representation, the Kappa coefficient (Carletta 1996) can be applied to measure the success (see Powers (2012) for Kappa shortcomings).
Dialogue efficiency Dialogue efficiency or dialogue costs are measures which are related to the length of the dialogue (Walker et al. 1997) . For instance, the number of turns or the elapsed time are such measures. More intricate measures could include the number of inappropriate repair utterances or the number of turns required for a subdialogue to fill a single slot.
In the following, we introduce the most important research for both of the aforementioned evaluation procedures. Finally, we briefly cover the evaluation methods employed on the subsystems of the pipleline. However, the main focus of this review is the evaluation of the dialogue system's behaviour.

User satisfaction modelling
User satisfaction modelling is based on the idea that the usability of a system can be approximated by the satisfaction of its users. The research in this area is concerned with three goals: measure the impact of the properties of the dialogue system on the user satisfaction (explainability requirement), automate the evaluation process based on these properties (automation requirement), and use the models to evaluate different dialogue strategies (differentiability requirement). Usually, a predictive model is fit, which takes the properties as input and uses the human judgements as target variable. Thus, modelling the user satisfaction as either a regression or a classification task. There are different approaches to measure the user satisfaction, which are based on two questions: who evaluates the dialogue and at which granularity is the dialogue evaluated? The first question allows for two groups; either the dialogue is evaluated by the users themselves or by objective judges. The second question allows for different points on a spectrum. On one end, the evaluation takes place on the dialogue level, on the other end the evaluation takes place at the exchange level. The question of who evaluates the dialogue is often especially at the centre of discussion. Here, we will shortly summarize the main points.
User or expert ratings There are three main criticisms regarding the judgments made by users: • Reliability Evanini et al. (2008) state as a main argument that users tend to interpret the questions on the questionnaires differently, thus making the evaluation unreliable. Gašić et al. (2011) noted that also in the lab setting, where users are given a predefined goal, users tend to forget the task requirements, thus, incorrectly assessing the task success. Furthermore, in the in-field setting, where the feedback is given optionally, the judgements are likely to be skewed towards the positive interactions.
• Cognitive demand Schmitt and Ultes (2015) note that rating the dialogue puts more cognitive demand on users. This is especially true if the evaluation has to be done at the exchange level. This would falsify the judgments about the interaction. • Impracticability: Ultes et al. (2013) note the impracticability of having a user rate the live dialogue, as he would have to press a button on the phone, or have a special installation to give feedback. Ultes et al. (2013) analyzed the relation between the user ratings and ratings given by objective judges (called experts). Especially, they investigated if the ratings from the experts could be used to predict the ratings of the users. Their results showed that the user ratings and the expert ratings are highly correlated with a Spearman's score of = 0.66(p < 0.01) . Thus, expert ratings can be used as replacement for user judgments. Furthermore, they trained classifiers using the expert rating as targets and evaluated on the user ratings as targets. The best performing classifier achieved an unweighed average recall (UAR) of 0.34 compared to the best classifier trained on user satisfaction, which achieved UAR = 0.5 . These results indicate that it is not possible to precisely predict the user satisfaction. However the correlation scores show that the predicted scores of both models correlate equally to the user satisfaction p = 0.6 . Although the models cannot be used to exactly predict the user satisfaction, the authors showed that the expert ratings are strongly related to user ratings.
In the following, we present different approaches to user satisfaction modelling. We cover the most important research for each of the various categories.
PARADISE Framework PARADISE (PARAdigm for DIalog System Evaluation) (Walker et al. 1997) is the most known evaluation framework proposed for task-oriented systems. It is a general framework, which can be applied to any task-oriented system, since it is domain-independent. It belongs to the evaluation methods which are based on user ratings on the dialogue level, although it allows for evaluations of sub-dialogues. Originally, the motivation was to produce an evaluation procedure, which can distinguish between different dialogue strategies. At that time, the most widely used automatic approach was based on the comparison of utterances with a reference answer (Hirschman et al. 1990). Methods based on comparisons to reference answers suffer from various drawbacks: they cannot discriminate between different strategies, they are not capable of attributing the performance on system specific properties, and the approach is not generalizable to other tasks.
The main idea of PARADISE is to combine different measures of performance into a single metric, and in turn assess the contribution of each of these measures to the final user satisfaction. PARADISE originally uses two objective measures for performance: task-success and measures that define the dialogue cost (as explained above).
An overview of the PARADISE framework is depicted in Fig. 5. The user interacts with the dialogue system and completes a questionnaire after the dialogue ends. From the questionnaire, a user satisfaction score is computed, which is used as the target variable. The input variables to the linear regression models are extracted from the logged conversation data. The extraction can be done automatically (e.g. for task-success as discussed above) or manually by an expert (e.g. for inappropriate repair utterances). Finally, a linear regression model is fitted to predict the user satisfaction for a given set of input variables.
Thus, PARADISE models the (subjective) performance of the system with a linear combination of objective measures (task-success and dialogue costs). Applying multiple linear regressions showed that only the task-success measure and the number of repetitions are significant. In a follow-up study , the authors further investigated PARADISE's ability to generalize to other systems and user populations and its predictive power. For this, they applied PARADISE on three different dialogue systems: ELVIS (a dialogue system for accessing emails), ANNIE (a dialogue system for voice dialing and messaging), and TOOT (a dialogue system for accessing train schedules). In a large-scale user study, they collected 544 dialogues over 42 h of speech. For these experiments, the authors worked with an extended number of quality measures: e.g. number of barge-ins (i.e. sudden interruption by the user), number of cancel operations, number of help requests. A survey at the end of the dialogue was used to measure the user satisfaction. The survey asked about various aspects: e.g. speech recognition performance, ease of the task, if the user would use the system again. Based on the survey, the user satisfaction score is computed and used as the target variable to train the PARADISE framework as described above. Table 6 shows the generalization scores of PARADISE for different scenarios.
According to these scores, we obtain the following observations: • A linear regression model is fitted on 90% of the data and evaluated on the remaining 10% . The results show that the model is able to explain R 2 = 50% of the variance, which is considered to be a good predictor by the authors. • Training the regression model on the data for one system and evaluating the model on the data for another dialogue system (e.g. train on the ELVIS data and evaluate on the TOOT data) show high variability as well. The evaluation on the TOOT system data yields much higher scores than evaluating on the ANNIE data. These results show that the model is able to generalize to data of other dialogue systems to a certain degree. • The evaluation of the generalizability of the model across different populations of users yields a negative result. When trained on dialogue data from conversation by novice users (NOVICES), the linear model is not capable of predicting the scores by experienced users (ANNIE EXPERTS) of the dialogue system.
The PARADISE framework is not only able to find the factors, which have the most impact on the rating, it is also capable of predicting the ratings. However, the experiments also revealed that the framework is not capable of distinguishing between different user groups. This result was confirmed by Engelbrecht et al. (2008), which tested the predictive power of PARADISE for individual users. User satisfaction at the exchange level In contrast to rating the dialogue as a whole, in some cases it is important to know the rating at each point in time. This is especially useful for online dialogue breakdown detection. There are two approaches to modelling the user satisfaction at the exchange level: annotate dialogues at the exchange level either by users (Engelbrecht et al. 2009a) or by experts (Higashinaka et al. 2010;Schmitt and Ultes 2015). Different models can be fitted with the sequential data: Hidden Markov Models (HMM), Conditional Random Fields or Recurrent Neural Networks are the most obvious choice, but also SVM based approaches are possible. Engelbrecht et al. (2009a) model user satisfaction as a continuous process evolving over time, where the current judgment depends on the current dialogue events and the previous judgments. Users interacted with the dialogue system and judged the dialogue after each turn on a 5-point scale using a number pad. An HMM was trained based on these target values and annotated dialogue features. Some input features were manually annotated, which is not a reasonable setting for online breakdown detection. Higashinaka et al. (2010) modelled the evaluation similarly as in Engelbrecht et al. (2009a). In their study, they evaluated different models (HMM and CRF), different measures to evaluate the trained model, and addressed the question of subjectivity of the annotators. The input features to the model were the dialogue acts and the target variables were the annotations by experts, which listened to the dialogue. The low inter-rater agreement and the fact of only using dialogue acts as inputs made the model perform only marginally better than the random baseline.
A different approach was taken by Hara (2010), who relied on dialogue-level ratings, but trained the model on n-grams of dialogue-acts. More precisely, they used as input features n consecutive dialogue acts and used the dialogue-level rating as target variable (on a 5-point scale and an extra class to denote unsuccessful task). The model achieved an accuracy of only 34.4% using a 3-gram model. Further testing yielded that the model is able to predict the task-success with an accuracy of 94.7%.
These approaches suffer from the following problems: they either rely on manual feature extraction, which is not useful for online breakdown detection or they used only dialogue acts as input features, which does not cover the whole dialogue complexity. Furthermore, the approaches had issues with data annotation, either having low inter-rater agreement or using dialogue-level annotation. Schmitt and Ultes (2015) addressed these issues by proposing Interaction Quality (see next paragraph) as approximation to user ratings at the exchange level.
Interaction quality Interaction Quality is a metric proposed by Schmitt and Ultes (2015) with the goal to allow the automatic detection of problematic dialogue situations. The approach is based on letting experts rate the quality of the dialogue at each point in timethe median rating of several expert ratings at the exchange level is called Interaction Quality. The experiments in this study were conducted using the Let's Go bus information system Black and Eskenazi (2009). Figure 6 shows the overview of the Interaction Quality procedure. The user interacts with the dialogue system and the conversation's relevant data is logged. From the logs, the input variables are automatically extracted. The target variables are manually annotated by experts, from which the target variable is derived. Based on the input and target variables, a support vector machine (SVM) is fitted.
Interaction Quality is meant to approximate user satisfaction. In this study, the authors showed that Interaction Quality is an objective and valid approximation to user satisfaction, which is easier to obtain. This is especially important for in-field evaluations of dialogue systems, which are practically infeasible to be rated by users at the exchange level. Thus, it is important that in-field dialogues can be rated by experts at the exchange level. The challenge is to make sure that the ratings are objective, i.e. to eliminate the subjectivity of the experts as much as possible.
Since there is no possibility to gather user satisfaction scores at the exchange level from in-field conditions, the authors relied on user satisfaction scores from lab experiments and Interaction Quality scores over dialogues from both in-field and lab conditions. For the lab experiments, users interacted with the Let's Go bus information system (Black and Eskenazi 2009) and used a special device to rate the dialogue after each turn. These scores are referred to as user satisfaction. The dialogues were then rated by experts on the exchange level. These ratings are referred to as Interaction Quality. The authors found a Fig. 6 Overview of the interaction quality procedure (Schmitt and Ultes 2015) strong correlation (Spearman's = 0.66 ) between Interaction Quality and user satisfaction in the lab environment, which means that Interaction Quality is a valid substitute for user satisfaction. In order to assess if Interaction Quality is a valid measure for rating in-field conversations, experts rated 200 dialogues from the Let's Go Field Corpus (Schmitt et al. 2012) and measured the agreement among the experts. The experts achieved a strong correlation (Spearman's = 0.72).
Based on these Interaction Quality scores a predictive model is trained to automatically judge the dialogue at any point in time. In order to automatically predict Interaction Quality, the input variable need to be automatically extractable from the dialogue system. From each subsystem of a task-oriented dialogue system (Fig. 2), various values are extracted (AUTO features). Additionally, the authors experimented with hand-annotated features such as emotions (EMO) and user specific features (USER), such as age or gender, as well as semi-automatically annotated data such as the dialogue acts (similar to Higashinaka et al. 2010). Based on these input variables, the authors trained various SVMs, one for each target variable, namely Interaction Quality for both in-field and the lab data as well as the user satisfaction label for the lab data. Table 7 shows the scores achieved for the various target variables and input feature groups.
The in-field Interaction Quality model ( IQ field ) achieves a correlation of = 0.776 to the human judges, based on the automatically extracted features, with the ASR features alone the correlation score lies at = 0.753 . The addition of the emotional and user -specific features do not increase the scores significantly. A similar behaviour is measured for the lab Interaction Quality model ( IQ lab ), which achieves high scores with ASR features alone ( = 0.856 ) and profits only marginally from the inclusion of the emotional features. However, the model improves when including user specific features ( = 0.894 ). The lab based user satisfaction model ( US lab ) achieves lower scores with = 0.668 for the automatic features. Table 8 shows the cross model evaluation. The IQ field model can be used to predict IQ lab labels and vice versa ( ∼ 0.66 ). Furthermore, the IQ lab model is able to predict the US lab variable. These results show that Interaction Quality is a good substitute to user satisfaction Table 7 Model performance (in terms of ) on the test set. Schmitt and Ultes (2015) ASR denotes the features by the automatic speech recognition system. AUTO denotes automatically extracted features from the dialogue system pipeline (e.g. dialogue acts). EMO denotes features that capture the users emotions (e.g. anger). USER denotes user specific features (e.g. age, gender) and that the models based on Interaction Quality yield high predictive performance when trained on the automatically extracted features. This allows to evaluate an ongoing dialogue in real-time at the exchange level and ensures high correlation to the actual user satisfaction.

User simulation
User Simulators (US) are tools that are designed to simulate the user's behaviour. There are two main applications for US: (1) for training the dialogue manager in an offline environment, and (2) to evaluate the dialogue policy.
Training environment User Simulations are used as a learning environment to train reinforcement -learning based dialogue managers. They mitigate the problem of recruiting humans to interact with the systems, which is both time-and cost-intensive. There is a vast amount of literature on designing User Simulations as training environment, for a comprehensive survey refer to Schatzmann et al. (2006). There are several considerations to be made when building a User Simulation.
• Interaction level Does the interaction take place at the semantic level (i.e. on the level of dialogue acts) or at the surface level (i.e. using natural language understanding and generation)? Thus, it is more realistic to model these changes as well. • Error model Whether and how to realistically model the errors made by the components of the dialogue system. • Evaluation of the user simulation For a discussion on this topic refer to Pietquin and Hastie (2013). There are two main evaluation strategies: direct and indirect evaluation. The direct evaluation of the simulation is based on metrics (e.g. precision and recall on dialogue acts, perplexity). The indirect evaluation measures the utility of the user simulation (e.g. by evaluating the trained dialogue manager).
The most popular approach to user simulation is based on the agenda-based user simulation (ABUS) ). The simulations takes place at the semantic level, the user goal stays fixed throughout the interaction, and the user behaviour is represented as a priority ordered stack of necessary user actions. The ABUS was evaluated using indirect methods, by performing a human study on a dialogue system trained with the ABUS. The results show that the DS achieved an average task success rate of 90.6% based on 160 dialogues. The ABUS system works by randomly generating a hidden user goal (i.e. the goal is unknown to the dialogue system), which consists of constraints and request slots. From this goal, the ABUS system generates a stack of dialogue acts in order to reach the goal, which is the agenda. During the interaction with the dialogue system, the ABUS adapts the stack after each turn, e.g. if the dialogue system misunderstood something, the ABUS system pushes a negation act onto the stack. Similar to other aspects of dialogue systems, more recent work is based on neural network based approaches. The Neural User Simulator (NUS) by (Kreyssig et al. 2018) proposes an end-to-end trainable architecture based on neural networks. The system performs the interaction on the surface instead of the semantic level, during the training it considers variable user goals, and the evaluation is performed indirectly. The indirect evaluation is performed from two different perspectives. First, the dialogue system, which is trained with the NUS is compared to a dialogue system trained with ABUS in the context of a human evaluation. Here, the authors report the average reward and the success rate. In both cases the NUS-trained system performs significantly better. The second evaluation is performed in a cross-model evaluation (Schatztnann et al. 2005), i.e. the NUS-trained dialogue system is evaluated using the ABUS system and vice-versa. Here, the NUS system performed significantly better as well. This indicates that the NUS system is diverse and realistic.
Model based evaluation The idea of model based evaluation is to model the user behaviour but to put more emphasis on modelling a large variety of behavioural aspects. Here, the focus does not lie in the shaping of rewards for reinforcement learning, rather, the focus lies on understanding the effects of different types of behaviour on the quality of the interaction. Furthermore, the goal is to gain insights on the effects of adapting a dialogue strategy, i.e. evaluate the changes made to the dialogue system. Engelbrecht et al. (2009b) introduced the MeMo workbench, which allows the modelling of user simulations. The main focus is to model different types of users and typical errors the users make. Möller et al. (2006) introduced various types of conceptual errors, which users tend to make. There errors arise from the discrepancy between how the user expects the system to behave and the actual system behaviour. For instance: • State errors arise when the user input cannot be interpreted in the current state, but might be interpretable in a different state. • Capability errors arise when the system cannot execute the user's commands due to missing capability. • Modelling errors arise due to discrepancies in how the user and the system model the world. For instance, when presented with a list of options and the system allows to address the elements in the list by their positions, but the user addresses them by their name.
On the other hand, the workbench allows the definition of various user groups based on different characteristics of a user. The characteristics used in Engelbrecht et al. (2009b) include: affinity to technology, anxiety, problem solving strategy, domain expertise, age and deficits (e.g. hearing impairment). Behavioural rules are associated to each of the characteristics. For instance, a user with high domain expertise might use a more specific vocabulary. The rules are manually curated and are engineered to influence the probabilities of user actions. During the interaction, the user model selects a task to solve similar to the aforementioned approaches for reinforcement-learning environments. In order to evaluate the user simulation, the authors compared the results of an experiment conducted with real users to the experiments conducted with the MeMo workbench. This evaluation procedure is aimed at finding whether the simulation yields the same insights as a user study. For this, they invited users from two user groups, namely older and younger users. The participants interacted with two versions of a smart-home device control system: the versions differed in the way they provide help to the users. The comparison between the user simulation and the user study results was done at various levels: • High-level features, such as concept error rates or average number of semantic concepts per user turn ( # ) AVP. Here, the results show that the simulation was not always able to recreate the absolute values, it was able to replicate the relative results. This is helpful, as it would lead to the same conclusions for the same questions.
• User judgment prediction which is based on a predictive model trained using the PAR-ADISE framework. Here, the authors compared the real user judgments to the predicted judgments (where the linear model predicted the judgments of the simulated dialogue). Again, the results show that the user model would yield the same conclusions as the user study, namely that young users rated the system higher than the older users and that old users judged the dynamic help system worse than the other. • Precision and Recall of predicted actions. Here, the simulation is used to predict the next user action for a given context from a dialogue corpus. The predicted user action is compared to the real user action and based on this precision and recall is computed.
The results show that precision and recall are relatively low.
The model-based user simulations are designed with the idea of allowing the evaluation of a dialogue system early in the development stage. Furthermore, they emphasize the need of interpretability, i.e. being able to understand how a certain change in the dialogue system influences the quality of the dialogue. This lies in contrast to the user simulations for reinforcement learning, which are aimed at training a dialogue system and use the reward as a measure of quality. However, the reward is often only based on the task success and the number of turns.

Subsystems evaluation
This section briefly outlines the different evaluation metrics employed on every subsystem, composing a pipelined Dialogue System, namely Natural Language Understanding, Dialogue State Tracker and Natural Language Generation systems. Natural language understanding (NLU) Since NLU is often cast as a classification task, NLU systems are often evaluated in the literature with regard to classification-based metrics. There are three widely used metrics (Tur and De Mori 2011): Sentence Level Semantic Accuracy (SLSA), Slot Error Rate (SER) (also called Concept Error Rate (CER)), and F-measures. The SLSA measures the rate of sentences where the intents are correctly classified. The SER metric measures the rate of inserted, deleted or substituted concepts with respect to the annotated concept as a reference. Finally, the F-measures compute the precision and recall of the correctly detected slots. In early systems, the distance between hypothesized sentences and reference ones is calculated with a Levenshtein distance (Levenshtein 1966) or using the Word Error Rate (Chotimongkol and Rudnicky 2001), which fail to capture the semantic similarities of utterances.
Dialogue state trackers (DST) DST usually report a probability distribution over the possible next states. In order to measure the performance of such systems, accuracy and L2 metrics are widely used (Metallinou et al. 2013;Henderson et al. 2014;Mrkšić et al. 2017). Accuracy measures whether the state hypothesis with the higher probability is the correct one. Having a high accuracy is crucial because DST systems must commit to a single interpretation of user's needs. L2 metric captures how well calibrated the output probabilities are, which is important when multiple dialogue states are considered in action selection.
Natural language generation (NLG) NLG systems translate the dialogue act into natural language, the dialogue act is composed of slot-value pairs, which the NLG system renders. The evaluation focuses on two aspects: the correctness of the content and the quality of the surface realization. For the correctness, the F1 score is used (Mei et al. 2016), as well as the slot error rate ) (i.e. the ratio of the slots which have been correctly rendered). For the quality of the surface realization, the word overlap metrics are used (e.g.

3
BLEU (Papineni et al. 2002), or ROUGE (Lin 2004)). However, since the automated metrics do not necessarily capture all aspects of the output's quality, usually a human evaluation is performed, which usually asks about the naturalness and quality of the generated utterance (Dušek et al. 2020).

Characteristics
Conversational dialoge systems (also referred to as chatbots and social bots) are usually developed for unstructured, open-domain conversations with its users. They are often not developed with a specific goal in mind, other than to maintain an engaging conversation with the user (Zhou et al. 2018). These systems are usually built with the intention to mimic human behaviour, which is traditionally assessed by the Turing Test (more on this later). However, Conversational dialogue systems might also be developed for practical applications. "Virtual Humans", for instance, are a class of conversational agents developed for training or entertainment purposes. They mimic certain human behaviours for specific situations. For instance, a Virtual Patient mimics the behaviour of a patient, which is then used to train medical students (Kenny et al. 2009;Mazza et al. 2018). Early versions of conversational agents stem from the psychology community with ELIZA (Weizenbaum 1966) and PARRY (Colby 1981). ELIZA was developed to mimic a Rogerian psychologist, whereas PARRY was developed to mimic a paranoid mind.
Modelling approaches Generally, there are two main approaches for modelling a Conversational dialogue system: rule-based systems and corpus-based systems.
Early systems, such as ELIZA (Weizenbaum 1966) and PARRY (Colby 1981) are based on a set of rules which determine their behaviour. ELIZA works on pattern recognition and transformation rules, which take the user's input and apply transformations to it in order to generate responses.
Recently, conversational dialogue systems have gained a renewed attention in the research community, as shown by the recent effort to generate and collect data for the (RE-)WOCHAT workshops. 7 This renewed attention is motivated by the opportunity of exploiting large amounts of dialogue data (see Serban et al. (2018) for an extensive study as well as Sect. 6) to automatically author a dialogue strategy that can be used in conversational systems such as chatbots (Banchs and Li 2012;Charras et al. 2016). Most recent approaches train conversational agents in and end-to-end fashion using deep neural networks, which mostly rely on the sequence-to-sequence architecture (Sutskever et al. 2014).
In the following, we focus on the corpus-based approaches used to model conversational agents. First, we describe the general concepts, and then the technologies used to implement conversational agents. Finally, we cover the various evaluation methods which have been developed in the research community.

Modelling conversational dialogue systems
Generally, there are two different strategies to exploit large amounts of data: 7 See http://works hop.colip s.org/re-wocha t/ and http://works hop.colip s.org/wocha t/.
• Utterance selection Here, the dialogue is modelled as an information retrieval task.
A set of candidate utterances is ranked by relevance. The dialogue structure is thus defined by the utterances in a dialogue database (Lee et al. 2009). The idea is to retrieve the most relevant answer to a given utterance, thus learning to map multiple semantically equivalent user-utterances to an appropriate answer. • Generative models Here, the dialogue systems are based on deep neural networks, which are trained to generate the most likely response to a given conversation history. Usually, the dialogue structure is learned from a large corpus of dialogues. Thus, the corpus defines the dialogue behaviour of the conversational agent.
Utterance selection methods can be interpreted as an approximation to generative methods. This approach is often used for modelling the dialogue system of Virtual Humans. Usually, the dialogue database is manually curated and the dialogue system is trained to map different utterances of the same meaning to the same response utterance. Another application of utterance selection is applied to integrate different systems Zhou et al. 2018). Here, the utterance selection system selects from a candidate list, which is comprised of outputs of different subsystems. Thus, given a set of dialogue systems, the utterance selection module is trained to select for the given context, the most suitable output from the various dialogue systems. This approach is especially interesting for dialogue systems, which work on a large number of domains and incorporate a large amount of skills (e.g. set alarm clock, report the news, return the current weather forecast). Here, we present the technologies for corpus-based approaches, namely the neural generative models and the utterance selection models.

Neural generative models
The architectures are inspired by the machine translation literature (Ritter et al. 2011), especially neural machine translation. Neural machine translation models are based on the Sequence to Sequence (seq2seq) architecture (Sutskever et al. 2014), which is composed of an encoder and a decoder. They are usually based on a Recurrent Neural Network (RNN). The encoder maps the input into a latent representation on which the decoder is conditioned. Usually, the latent representation of the encoder is used as the initial state of the recurrent cell in the decoder. The earliest approaches were proposed by Shang et al. (2015); Vinyals and Le (2015), which trained a seq2seq model on a large amount of dialogue data (in the order of 10 6 exchanges). There are two fundamental weaknesses with the neural conversational agents. Firstly, they do not take into account the context of the conversation. Since the encoder only reads the current user input, all previous states are ignored. This leads to dialogues, where the dialogue system does not refer to previous information, which might lead to nonsensical dialogues. Secondly, the models tend to generate generic answers that follow the most common pattern in the corpus. This renders the dialogue monotonous and in the worst case leads to repeating the same answer, regardless of the current input. We briefly discuss these two aspects in the following section.
Context The context of the conversation is usually defined as the previous turns in the conversations. It is important to take these into account as they contain information relevant to the current conversation. Sordoni et al. (2015) proposed to model the context by adding the dialogue history as a bag-of-words representation. The decoder is then conditioned on the encoded user utterance and the context representation. An alternative approach was proposed by Serban et al. (2016), who proposed the hierarchical-encoder decoder architecture (HRED), shown in Fig. 7, which works in three steps: 1. A turn-encoder (usually a recurrent neural network) encodes each of the previous utterances in the dialogue history, including the last user utterance. Thus, for each of the preceding turns a latent representation is created. 2. A context-encoder (a recurrent neural network) takes the latent turn representations as input and generates a context representation. 3. The decoder is conditioned on the latent context representation and generates the final output.
The HRED architecture is used as basis for more complex neural architectures for dialogue system, such as the multi-resolution recurrent neural network (MrRNN) , which extends the HRED architecture by adding encoders that capture different levels of granularity (e.g. entity level, word level, or action level). Furthermore, the HRED encoder is used to generate the representation for the context in the utterance selection models (see Sect. 4.2.2).
Variability There are two main approaches on dealing with the issue of repetitive and universal responses: • Adapt the loss functions. The main idea is to adapt the loss function in order to penalize generic responses and promote more diverse responses. Li et al. (2016a) propose two loss functions based on maximum mutual information: one is based on an anti-language model, which penalizes high-frequency words; the other is based on the probability of Fig. 7 Overview of the HRED architecture. There are two levels of encoding: (i) the utterance encoder, which encodes a single utterance and (ii) the context encoder, which encodes the sequence of utterance encodings. The decoder is conditioned on the context encoding the source given the target. Li et al. (2016b) propose to train the neural conversational agent using the reinforcement-learning framework. This allows to learn a policy that can plan in advance and generate more meaningful responses. The major focus is the reward function, which encapsulates various aspects: ease of answering (reduce the likelihood of producing a dull response), information flow (penalize answers that are semantically similar to a previous answer given), and semantic coherence (based on the mutual information). • Condition the decoder. The seq2seq models perform a shallow generation process. This means that each sampled word is only conditioned on the previously sampled words.
There are two methods for conditioning the generation process: condition on stochastic latent variables or on topics. Serban et al. (2017c) enhance the HRED model with stochastic latent variables at the utterance level and on the word level. At the decoding stage, first the latent variable is sampled from a multivariate normal distribution and then the output sequence is generated. Xing et al. (2017) add a topic-attention mechanism in their generation architecture, which takes as inputs topic words which are extracted using the Twitter LDA model (Zhao et al. 2011). The work by Ghazvininejad et al. (2018) extends the seq2seq model with a Facts Encoder. The "facts" are represented as a large collection of raw texts (Wikipedia, Amazon reviews, etc.), which are indexed by named entities.

Utterance selection methods
Utterance selection methods generally try to devise a similarity measure that measures the similarity between the dialogue history and the candidate utterances. There are roughly three different types of such measures: • Surface form similarity. This measures the similarity at the token level. This includes measures such as: Levenshtein distance, METEOR (Lavie and Denkowski 2009), or TF-IDF retrieval models (Charras et al. 2016;Dubuisson Duplessis et al. 2016). For instance , Dubuisson Duplessis et al. (2017) propose an approach that exploits recurrent surface text patterns to represent dialogue utterances. • Multi-class classification task. These methods model the selection task as a multi-class classification problem, where each candidate response is a single class. For instance, Gandhe and Traum (2013) model each utterance as a separate class, and the training data consists of utterance-context pairs on which features are extracted. Then a perceptron model is trained to select the most appropriate response utterance. This approach is suitable for applications with a small amount ( ∼ 100 ) of candidate answers. • Neural network based approaches. Neural network architectures were introduced to leverage large amounts of training data. Usually, they are based on a siamese architecture, where both the current utterance and a candidate response are encoded. Based on this representation a binary classifier is trained to distinguish between relevant responses and irrelevant. One well-known example is the dual encoder architecture proposed by Lowe et al. (2017b). Dual Encoders transform the user input and a candidate response into a distributed representation. Based on the two representations a logistic regression layer is trained to classify the pair of utterance and candidate response as either relevant or not. The softmax score of the relevant class is used to sort the candidate responses. The authors experimented with different neural network architectures for modelling the encoder, such as recurrent neural networks or long short-term memory networks (LSTM) (Hochreiter and Schmidhuber 1997).

Evaluation methods
Automatically evaluating conversational dialogue systems is an open problem. The difficulty in automating this step can be attributed to the characteristics of the conversational dialogue system. Without a clearly defined goal or task to solve, and a lack of structure in the dialogues, it is not clear which attributes of the conversation are relevant to measure the system's quality. Two common approaches to assess the quality of a conversational dialogue system are to measure the appropriateness of its responses, or to measure the human likeness thereof. Both these approaches are very coarse-grained and might not reveal the complete picture. Nevertheless, most approaches in evaluation follow these principles. Depending on the characteristics of a specific dialogue system, more fine-grained approaches to evaluation can be applied, which measure the capability of the specific characteristic. For instance, a system built to increase the variability of its answers might be evaluated based on lexical complexity measures (such as token-type ratio or lexical density. For a more in-depth discussion please refer to Lu (2012). In the following, we introduce the automated approaches for evaluating conversational dialogue systems. In the first part, we discuss the general metrics that can be applied to both the generative models as well as the selection-based models. We then survey the approaches specifically designed for the utterance selection approaches, as they can exploit various metrics from information retrieval.

General metrics for conversational dialogue systems
There are generally two levels in order to evaluate a conversational dialogue system: coarse-grained and fine-grained evaluations. The coarse-grained evaluations focus on the adequacy of the responses generated or selected by the dialogue system. On the other hand, fine-grained evaluations focus on specific aspects of its behaviour. Coarse-grained evaluations are based on two concepts: adequacy (or appropriateness) of a response, and the human likeness thereof. Fine-grained evaluations focus on specific behaviours that a dialogue system should manifest. Here, we focus on the methods devised for coherence and the ability of maintaining the topic of a conversation. In the following, we give an overview of the methods that have been designed to automatically evaluate the above dimensions. Appropriateness This is a coarse-grained concept to evaluate a dialogue, as it encapsulates many finer-grained concepts, e.g. coherence, relevance, or correctness, among others. There are two main approaches in the literature: word-overlap based metrics and methods based on predictive models inspired by the PARADISE framework (see Sect. 3.4.1).
• Word-overlap metrics These metrics were originally proposed by the machine translation and the summarization community. They were initially a popular choice of metrics for evaluating dialogue systems seeing as they are easily applicable. Popular metrics such as BLEU score (Papineni et al. 2002) and ROUGE (Lin 2004) were used as approximation for the appropriateness of an utterance. However, Liu et al. (2016) showed that neither of the word-overlap based scores have any correlation to human judgments.
Based on the criticism of the word-overlap metrics, several new metrics have been proposed.  propose to include human judgments into the BLEU score, which they call BLEU. The human judges rated the reference responses of the test set according to the relevance to the context. The ratings are used to weight the BLEU score to reward high-rated responses and penalize low-rated responses. The correlation to human judgments was measured by means of Spearman's . BLEU has a correlation of = 0.484 , which is significantly higher than the correlation of the BLEU score, which lies at = 0.318 . Although this increases the correlation of the metric to the human judgments, this procedure involves human judgments to label the reference sentences.
• Trained metrics Lowe et al. (2017a) present an automatic dialogue evaluation model (ADEM), a recurrent neural network trained to predict appropriateness ratings by human judges. The human ratings were collected via Amazon Mechanical Turk, where the judges were presented with a dialogue context and a candidate response, which they rated on appropriateness on a scale from 1 to 5. Based on the ratings, a recurrent neural network was trained to score the model response, given the context and the reference response. The Pearson's correlation between ADEM and the human judgments is computed on two levels: the utterance level and at the system level, where the system level rating is computed as the average score at the utterance-level achieved by the system. The Pearson's correlation for ADEM lies at 0.41 on the utterance level and at 0.954 on the system level. For comparison, the correlation to human judgments for the ROUGE score only lies at 0.062 on the utterance level and at 0.268 at the system level.
While ADEM relies on human labelled data, Tao et al. (2018) present a method, which has no need of human judges. Their model is based on two observations. Firstly, a response that is close to the ground truth is likely to be good. Secondly, a response that is related to the last utterance or the context of the conversation is good. They propose two submodels to capture these insights. The first model computes a representation of both the ground truth and the generated response based on min-and maxpooling of word embeddings. Then the cosine similarity is computed to measure the relatedness of the ground truth and the generated response. The second model rates the relatedness between the conversational context and the generated response. In order to train this model, Although trained metrics have a significantly higher correlation to human judgements, they are show not to be robust (Sai et al. 2019). In fact, with simple manipulations of the response under consideration can lead to significant changes in the score of ADEM. For instance, in 48.66% of cases the predicted score increased when the generated response was reversed. In 86.93% of cases the predicted score increased when the generated response was replaced with a dull dummy response. Thus, creating reliable trained metrics is still an open problem.

Human likeness
The classic approach to measure the quality of a conversational agent is the Turing Test devised by Turing (1950). The idea is to measure if the conversational dialogue system is capable of fooling a human into thinking that it is a human as well. Thus, according to this test, the main measure is the ability to imitate human behaviour.
Inspired by this idea, the use of adversarial learning (Goodfellow et al. 2014) can be applied to evaluate a dialogue system. The framework of a generative adversarial model is composed of two parts: the generator, which generates data, and the discriminator, which tries to distinguish whether the data is real or artificially generated. The two components are trained in an adversarial manner: the generator tries to fool the discriminator, and the discriminator learns at the same time to identify if the data is real or artificial. Adversarial Evaluation of dialogue systems was first studied by Kannan and Vinyals (2016), where the authors trained a generative adversarial network (GAN) on dialogue data, and used the performance of the discriminator as indicator for the quality of the dialogue. The discriminator achieved an accuracy of 62.5% which indicates a weak generator. However, the authors did not evaluate whether the discriminator score is a viable metric for evaluating a dialogue system.
A study on the viability of adversarial evaluation was conducted by Bruni and Fernandez (2017). For this, they compared the performance of discriminators to the performance of humans on the task of discriminating between real and artificially generated dialogue excerpts. Three different domains were used, namely: MovieTriples (46k dialogue passages) , SubTle (3.2M dialogue passages) (Banchs 2012) and Switchboard (77k dialogue passages) (Godfrey et al. 1992). The GAN was trained on the concatenation of the three datasets. The evaluation was conducted on 900 dialogue passages, 300 per dataset, which were rated by humans as real or artificially generated. The results show that the annotator agreement among humans was low, with a Fleiss (Fleiss 1971) = 0.3 , which shows that the task is difficult. The agreement between the discriminator and the humans is on par with the agreement among the humans, except for the Switchboard corpus, where = 0.07 . Human annotators achieve an accuracy score with respect to the ground-truth of 64-67.7% depending on the domain. The discriminator achieves lower accuracy scores on the Switchboard dataset but higher scores than humans on the other two datasets.
In order to evaluate the ability of the discriminators on different models, a seq2seq model was trained on the OpenSubtitles dataset (Tiedemann 2009) (80M dialogue passages). The discriminator and the human performance on the dialogues generated by the seq2seq model was evaluated. The results show that the discriminator performs better than the humans, which the authors attribute to the fact that the discriminators may pick up on patterns that are not apparent to humans. The agreement between humans and the discriminator is very low.
Fine-grained metrics The above methods for evaluating conversational dialogue systems work on a coarse-grained level. The dialogue is evaluated on the basis of producing adequate responses or its ability to emulate human behaviour. These concepts encompass more finer-grained concepts. In this section, we look at topic-based evaluation.
Topic-based evaluation This measures the ability of a conversational agent to talk about different topics in a cohesive manner. Guo et al. (2018) propose two dimensions of topic-based evaluation: topic breadth (can the system talk about a large variety of topics?) and topic depth (can the system sustain a long and cohesive conversation about one topic?). For topic classification, a Deep Averaging Network (DAN) was trained on a large amount of question data. DANs do topic classification and the detection of topicspecific keywords. The conversational data used to evaluate the topic-based metrics stems from the Alexa-Prize challenge, 8 which consists of millions of dialogues and hundreds of thousands of live user ratings (on a scale from 1 to 5). Using the DAN, the authors classified the dialogue utterances according to the topics.
Conversational topic depth is measured by the average length of a sub-conversation on a specific topic, i.e. multiple consecutive turns where the utterances are classified as being the same topic. The conversational breadth is measured on a coarse-and fine-grained level. Coarse-grained topic breadth is measured as the average number of topics a bot converses about during a conversation. On the other hand, topic breadth measures looks at the total number of distinct topic keywords across all conversations.
To measure the validity of the proposed metrics, correlations between the metric and the human judgments are computed. The conversational topic depth metric has a correlation of = 0.707 with the human judgments. The topic breadth metric has a correlation of = 0.512 with the human judgments. The lower correlation of the topic breadth is attributed to the fact that the users may not have noticed a bot repeating itself as they only conversed with a bot a few times.

Utterance selection metrics
The evaluation of dialogue systems based on utterance selection differs from the evaluation of generation-based dialogue systems. Here, the evaluation is based on metrics used in information retrieval, especially Recall@k (R@k). R@k measures the percentage of relevant utterances among the top-k selected utterances. One major drawback of this approach is that potentially correct utterances among the candidates could be regarded as incorrect.
Next Utterance selection Lowe et al. (2016) evaluate the impact of this limitation and evaluate whether the Next Utterance Classification (NUC) task is suitable to evaluate dialogue systems. For this, they invited 145 participants from Amazon Mechanical Turk (AMT) and 8 experts from their lab. The task was to select the correct response given a dialogue context (of at most six turns) and five candidate utterances, of which exactly one is correct. Note that the other four utterances could also be relevant, but are regarded as incorrect in this experiment. The study was performed on dialogues of three different domains: the SubTle Corpus (Banchs 2012) consisting of movie dialogues, the Twitter Corpus (Ritter et al. 2010) consisting of user dialogues, and the Ubuntu Dialogue Corpus (Lowe et al. 2015), which consists of conversations about Ubuntu related topics.
The human performance was compared to the performance of an artificial neural network, which is trained to solve the same task. The performance was measured by means of R@1 score. The results show that for all domains, the human performance was significantly above random, which indicates that the task is feasible. Furthermore, the results show that the human performance varies depending on the domain and the expertise level. In fact, the lab participants performed significantly better on the Ubuntu domain, which is regarded as harder as it requires expert knowledge. This shows that there is a range of performance that can be achieved. Finally, the results showed that the ANN achieved similar performance to the human non-experts and performed worse than the experts. This shows that this task is not trivial and by far not solved. However, the authors did not take into account the fact that multiple candidates responses could be regarded as correct. This is possible since the selection of the candidate response is performed by sampling at random from the corpus. On the other hand, it is not clear if their evaluation suffered from this potential limitation, as their results showed the feasibility and relevance of the NUC task. DeVault et al. (2011) and Gandhe and Traum (2016) tackle the problem of having multiple relevant candidate utterances and propose a metric which takes this into account. Their metrics are both dependent on human judges and measure the appropriateness of an utterance.
Weak agreementDeVault et al. (2011) propose the weak agreement metric. This metric is based on the observation that human judges only agree in about 50% of the cases on the same utterance for a given context. The authors attribute this to the fact that multiple utterances could be regarded as acceptable choices. Thus, the weak agreement metric regards an utterance as appropriate if at least one annotator chose this utterance to be appropriate.
The authors apply the weak agreement metric on the evaluation of a Virtual Human which simulates a witness in a war-zone and is designed to train military personnel in Tactical Questioning (Gandhe et al. 2009). They gathered 19 dialogues and 296 utterances in a Wizard-of-Oz experiment. To allow for more diversity, they let human experts write paraphrases of the commander role to ensure that the virtual character understands a larger variety of inputs. Furthermore, the experts expanded the set of possible answers by the virtual character by annotating other candidate utterances as appropriate.
The weak agreement metric was able to measure the improvement of the system when the extended dataset was applied: the simple system based on the raw Wizard-of-Oz data achieved a weak agreement of 43%; augmented with the paraphrases, the system achieved a score of 56%; and, finally, adding the manual annotation increases the score to 67%. Thus, the metric is able to measure the improvements made by the variety in the data.
Voted appropriateness One major drawback of the weak agreement is that it depends on human annotations and is not applicable to large amounts of data. Gandhe and Traum (2016) improve upon the idea of weak agreement by introducing the Voted Appropriateness metric. Voted Appropriateness takes the number of judges into account which selected an utterance for a given context. In contrast to weak agreement, which regarded each adequate utterance equally, Voted Appropriateness weights each utterance.
Similarly to the PARADISE approach, the authors of Voted Appropriateness fit a linear regression model on the pairs of utterances and contexts labelled with the amount of judges that selected the utterance. The fitted model only explains 23.8% of the variance. The authors compared the correlation of the Voted Appropriateness and the weak agreement metric to human judgments. The correlation was computed on the individual utterance level and the system level. For the system level, the authors used data from seven different dialogue systems and averaged the ratings over all dialogues of one system. On the interaction level, the Voted Appropriateness achieved a correlation score of 0.479 (p < 0.001, n = 397) , and the weak agreement achieved 0.485 (p < 0.001, n = 397) . On the system level, Voted Appropriateness achieved 0.893 (p < 0.01, n = 7) and weak agreement achieved 0.803 (p < 0.001, n = 397) . Thus, on the system level Voted Appropriateness performs closer to human judgments. Both metrics rely heavily on human annotations, which makes the metrics hardly suitable for largescale data driven approaches.

Question answering dialogue systems
A different form of task-oriented systems are Question Answering (QA) systems. Here, the task is defined as finding the correct answer to a question. This setting differs from the aforementioned task-oriented systems in the following ways: • Task-oriented systems are developed for a multitude of tasks (e.g. restaurant reservation, travel information system, virtual assistant, etc.),whereas QA systems are developed to find answers to specific questions. • Task-oriented systems are usually domain-specific, i.e. the domain is defined in advance through an ontology and remains fixed. In contrast, QA systems usually work on broader domains (e.g. factoid QA can be done over different domains at once), although there are also some QA systems focused only on a specific domain (Sarrouti and Ouatik El Alaoui 2017;Do et al. 2017). • The dialogue aspect for QA systems is not tailored to sound human-like, rather, the focus is set on the completion of the task. That is, to provide a correct answer to the input question.

Characteristics
Generally, QA systems allow the users to search for information using a natural language interface, and return short answers to the user's question (Voorhees 2006). QA systems can be broadly categorized into three categories (Bernardi and Kirschner 2010): single-turn QA, context QA, and Interactive QA.
Single-turn QA Single-turn QA is the most common type of system. Here, the system is developed to return a single answer to the users' question without any further interaction. These systems work very well for factoid questions (Voorhees 2006). However, they have difficulties handling complex questions, which require several inference steps (Iyyer et al. 2017a) or situations where systems need additional information from the user (Li et al. 2017a).
Single-turn QA can be approached from two main perspectives (Rogers et al. 2020a): • Open QA, where systems collect evidences and answers across several sources such as Web pages and knowledge bases (Fader et al. 2013) • Reading Comprehension (RC), where the answer is gathered from a single document. This is the most common approach.
RC systems can be oriented to: • Extractive RC, where systems extract spans of text with the answer. This approach has received a lot of attention fostered by the availability of popular benchmarks such as SQuAD (Rajpurkar et al. 2018), NewsQA (Trischler et al. 2017) or TriviaQA (Joshi et al. 2017). Each of these datasets contains thousands of examples, which permits to train Deep Learning systems and obtain good results. • Multiple-choice RC, where systems must select an answer from a set of candidates.
Multiple-Choice (MC) is a common way to measure reading comprehension in humans. This is why some researches have pointed MC as a better format to test language understanding of automatic systems (Rogers et al. 2020a). There exists several MC collections, mostly in English. In some cases it involves paying crowd-workers to gather documents and/or pose questions regarding those documents. MCTest (Richardson et al. 2013), for example, proposed for the workers to invent short, children friendly, fictional stories and four questions with four answers each, including deliberately wrong answers. As a way to encourage a deeper understanding of texts, the QuAIL dataset includes unanswerable questions (Rogers et al. 2020b). Other datasets were created from real world exams. This is the case of the well known MC dataset RACE (Lai et al. 2017), or the multilingual Entrance Exams (Rodrigo et al. 2018).
• Generative QA, where systems create a text that answers the question. The exact text is not necessarily contained in any document, which makes this a challenging task. This kind of systems has received less attention given that it is difficult to perform an exact evaluation and there are few datasets available (Kočiský et al. 2018).
There is a large amount of research in the area of single-turn QA and there are several surveys, we refer the reader to: Kolomiyets and Moens (2011);Diefenbach et al. (2018); Mishra and Jain (2016). In this survey, we focus on the evaluation of multi-turn QA systems, which is a much less researched area.
Context QA Context QA refers to systems which allow for follow-up questions to resolve ambiguities or keeping track of a sequence of inference steps (Peñas et al. 2012). The questions can be highly context-dependent and elliptical, with references to previous questions and answers, which can be seen as a dialog. In fact, it is common to include pronouns instead entities. That is, systems must rely not only on the source document and last question, but also on the context given by previous questions and answers.
Context QA systems are also named multi-turn QA (Choi et al. 2018) or sequential QA (Saha et al. 2018). The most common approach is to develop these systems for extractive RC. In some cases, context QA systems are used for answering complex questions. These systems assume that some complex questions are usually unrealistic but they can be decomposed into simpler inter-related questions (Iyyer et al. 2017b). Then, the system answers the simpler questions and obtain an answer to the initial complex question (Talmor and Berant 2018).
Interactive QA Interactive QA (IQA) systems combine context QA systems and taskoriented dialogue systems. The main purpose of the conversation module is to handle under-or-over constrained questions (Qu and Green 2002). E.g. if a question does not yield any results, the system might propose to relax some constraints. In contrast, if a question yields too many results, the interaction can be used to introduce new constraints to filter a list of results (Rieser and Lemon 2009). For a more in-depth discussion on IQA systems, refer to Konstantinova and Orasan (2013).

Technologies
Current QA technologies for single-turn QA are based on pre-trained transformer models such as BERT (Devlin et al. 2019), XLNet  or ALBERT (Lan et al. 2019). These models have been pre-trained from unlabeled text to do Masked Language Modeling and Next Sentence Prediction. Afterwards, each model can be fine-tuned in specific tasks such as those at Glue (Wang et al. 2018) or QA.
Fine-tuning for QA systems is done by modelling the span detection as prediction of the start and end token in the paragraph. The input to the system is a pair of question and paragraph. Thus, the trained system will output the span with the highest probability of being an answer to the question. These systems achieve the best results for extractive QA, as it can be seen in the corresponding leaderboards of the most popular collections 9 In the case of multi-turn QA, systems must be aware of the dialogue history. One approach is to reuse single-turn systems, augmenting the input with previous questions and answers . In some cases, the system may focus on modelling information gain and include pre-trained models such as BERT (Yeh and Chen 2019).
Given the importance of dealing with answer history, other researchers have proposed to represent answer history using embeddings from pre-trained models (Qu et al. 2019). Then, the system includes also a history attention mechanism to help in the selection of items in the history of the dialogue.
Other models include Adversarial Training and Knowledge Distillation over ROBERTA  to perform a better fine-tunning of pre-trained models (Ju et al. 2019). While Adversarial Training allows improving the performance of the system against data perturbations, Knowledge Distillation transfers knowledge from one machine to another to improve results of the second machine (Furlanello et al. 2018).

Evaluation of QA dialogue systems
The evaluation of QA systems has two aspects: the correctness of the answer and the flow of the conversation. Currently, most QA systems are evaluated based on the correctness of their answers. Even for multi-turn QA systems, the dialogue flow is often ignored during evaluation (Reddy et al. 2018;Choi et al. 2018).
Correctness metrics The evaluation of QA systems depends on the output of the system. For open QA, where the output is a ranking of sentences with potential answers, the evaluation is mostly based on ranking measures such as Mean Average Precision (MAP) or Mean Reciprocal Rank (MRR), but there are also evaluations based on precision, recall and F1 (Yang et al. 2015).
For multiple-choice RC the task is evaluated using accuracy (Clark and Etzioni 2016), that is, the number of times in which the system selected the correct answer.
For extractive QA, which is the most common approach, the output is a span of text. The retrieved span is compared with the ground truth answers and two kinds of evaluations are given (Rajpurkar et al. 2016): • Exact matching, which measures the percentage of candidate answers that match any one of the ground truth answers exactly. • Approximate matching based on F1, which measures the macro-average overlap between the bag of words of candidates and ground truth answers.
Dialogue evaluation The nature of multi-turn QA systems makes it quite hard to design accurate evaluation frameworks that go beyond the correctness measures, which do not take into consideration the dialogue aspect of the interaction. In fact, a proper evaluation of multi-turn QA systems requires humans to interact with the systems. The first evaluation framework designed specifically for IQA systems is based on a series of questionnaires to capture different aspects of the system (Kelly et al. 2009). The authors argue that metrics based on the relevance of the answers are not sufficient to evaluate an IQA system (e.g. it does not take the user feedback into account). Thus, they evaluate the usage of different questionnaires in order to assess the different systems. The questionnaires they propose are: • NASA TLX (cognitive workload questionnaire): Used to measure the cognitive workloads as subjects completed different scenarios. • Task questionnaire After each task the questionnaire is filled out, which focuses on the experiences of using a system for a specific task. • System questionnaire Compiled after using a system for multiple tasks. This measures the overall experiences of the subjects.
Their evaluation showed that the Task Questionnaire is the most effective at distinguishing among different systems. The evaluation of dialogue QA systems requires one to simulate some interactions with users and evaluate them. These interactions can, on the one hand, be created by real users, which is associated with high costs, and makes it hard to reproduce the experiments and reuse the data. For example, Li et al. (2017c) developed DailyDialog, a multi-turn dataset with 13k dialogues created by humans that also include emotion information.
On the other hand, the interactions can be automatically produced. However, it is challenging to create users' responses automatically. One approach for creating simulations is to provide some feedback based on the supplementary questions. For example, if an additional question asks for a location, the simulator can return a location contained in the dialogue's history (or related to it) (Li et al. 2017a). Nevertheless, the simulation can generate several errors. On the other hand, the simulation might only reward the generation of questions similar to a given template (Li et al. 2016b), which constrains the diversity of questions.
There is usually a weak correlation between automatic evaluations and human judgements in multi-turn QA. This is because most of the current QA dialogue systems are trained and tested using data where there is only a single response for each context (Serban et al. 2017c). Moreover, this data contains only a possible path to reach the correct answer, while the same answer could be reached with a different dialogue. In fact, there are many features involved in deciding the next response in a dialogue. This has been defined as the one-to-many problem of dialogues (Zhao et al. 2017).
Automatic evaluations based on multiple-reference responses have been proposed to alleviate the one-to-many problem. Multi-reference based evaluations include several correct responses for a given context. Thus, these evaluations promote diversity better than single-response approaches. Sordoni et al. (2015) created a synthetic multiple-reference dialogue corpus based on Twitter. Additional responses to the initial response were searched using Information Retrieval and rated by crowd workers. The authors kept only responses with a high rate.  created a dataset from Twitter following the work from Sordoni et al. (2015). However,  included all the synthetic responses (no matter the rate given by crowd workers) and used the data for testing a new metric called Discriminative BLEU. Sugiyama et al. (2019) performed another evaluation based on multiple-reference responses. They measured the correlation, using a regression-based approach, between systems' responses and a large set of both positive and negative human references. Gupta et al. (2019) extended the test split of DailyDialog (1k dialogues) with multiple references. They compared the results of using single-reference versus multiple-reference data. Both works showed a higher correlation of automatic evaluations with human judgments when using multiple-reference dialogues instead of single-reference data.

Evaluation datasets and challenges
Datasets play an important role for the evaluation of dialogue systems, together with challenges open to public participation. A large number of datasets have been used and made publicly available for the evaluation of dialogue systems in the last decades, but the coverage across dialogue components and evaluation methods (e.g. Sects. 3 and 4) is uneven.
Also note that datasets are not restricted to specific evaluation methods, as they can be used to feed more than one evaluation method or metric interchangeably. In this section, we cover the most relevant datasets and challenges, starting with select datasets. For further references, refer to a broad survey of publicly available datasets that have already been used to build and evaluate dialogue systems carried out by Serban et al. (2018).  1 3 The dialogue datasets selected for this section are listed in Tables 9, 10 and 11, where properties such as the topics covered and number of dialogues are indicated.

Datasets for task-oriented dialogue systems
Datasets are usually designed to evaluate specific dialogue components, and very few public datasets are able to evaluate an entire task-oriented dialogue system (e.g. Sect. 3). The evaluation of these kinds of systems is highly system-specific, and it is therefore difficult to reuse the dataset with other systems. Their evaluation also requires considerable human effort, as the involvement of individual users or external evaluators is usually needed. For example, in Gasic et al. (2013), which is a Partially observable Markov decision process -based dialogue system mentioned in Sect. 3.3.1 for the restaurants domain, the evaluation of policies is done by crowd-sourcers via the Amazon Mechanical Turk service. Mechanical Turk users were asked first to find some specific restaurants, and after each dialogue was finished, they had to fill in a feedback form to indicate if the dialogue had been successful or not. Similarly, for the end-to-end dialogue system by Wen et al. (2017) (cf. Sect. 3.3.2), also for the restaurants domain, human evaluation was conducted by users recruited via Amazon Mechanical Turk. Each evaluator had to follow a given task and to rate the system's performance. More specifically, they had to grade the subjective success rate, the perceived comprehension ability and naturalness of the responses.
Most of the task-oriented datasets are designed to evaluate components of dialogue systems. For example, several datasets have been released through different editions of the Dialog State Tracking Challenge, 10 focused on the development and evaluation of the dialogue state tracker component. However, even if these datasets were designed to test state tracking, Bordes et al. (2017) used them to build and evaluate a whole dialogue system, re-adjusting the dataset by ignoring the state annotation and reusing only the transcripts of dialogues. The Schema Guided Dialogue (SGD) dataset released for the 8th edition of DSTC was designed to test not only state tracking, but also intent prediction, slot filling and language generation for large-scale virtual assistants. SGD consists of almost 23K annotated multi-domain (bank, media, calendar, travel, weather, ...), task-oriented dialogues between a human and a virtual assistant.
The MultiWOZ (Multi-Domain Wizard-of-Oz) dataset represented a significant breakthrough in the scarcity of dialogues as it contains around 10K dialogues, which is at least one order of magnitude larger than any structured corpus available before . It is annotated with dialogue belief states and dialogue actions, so it can be used for the development of individual components of a dialogue system. But its considerable size makes it very appropriate for the training of end-to-end based dialogue systems. The main topic of the dialogues is tourism, containing seven domains, such as attractions, hospitals, police, hotels, restaurants, taxis and trains. Each dialogue can contain more than one of these domains.
Similar in size and content to MultiWOZ is Taskmaster-1 task-based dialogue dataset (Byrne et al. 2019). It includes around 13K dialogues in six domains: ordering pizza, setting auto repair appointments, arranging taxi services, ordering movie tickets, ordering coffee drinks and making restaurant reservations. What makes it different from the previous one is that more than a half of the dialogues are created following a self-dialogue methodology, in which a crowd-worker writes the full dialogue themselves. The authors claim that these self-dialogues have richer and more diverse language than, for example, MultiWOZ, as it is not restricted to a small knowledge base.
The largest human-generated and multi-domain dialogue dataset that is available to the public is MultiDoGo (Peskov et al. 2019), which comprises over 81K dialogues. These dialogues were created following the Wizard-of-Oz approach between a crowd-worker and a trained annotator. These participants were guided to introduce specific biases like intent or slot change, multi-intent, multiple slot values, slot overfilling and slot deletion in conversations. Additionally, over 54K of the total amount of the dialogues are annotated at the turn level for intent classes and slot labels. Dialogues are from six different domains: airline, fast food, finance, insurance, media and software support.
We will conclude this section by discussing two related tools, rather than a dialogue dataset. The first tool, called PyDial, 11 partially addresses the shortage of evaluation datasets for task-oriented systems. This is because it offers the opportunity for developing a dialogue management environment, based on reinforcement-learning for benchmarking purposes ). Thus, it makes it possible to evaluate and compare different task-oriented dialogue systems in the same conditions. This toolkit not only provides domain-independent implementations of different modules in a dialogue system, but also simulates users (see Sect. 3.4.2). It uses two metrics for the evaluation: (1) the average success rate and (2) the average reward for each evaluated policy model of reinforcementlearning algorithms. The success rate is defined as the percentage of dialogues that are completed successfully. Thus, it is closely related to the task-completion metric used by the PARADISE framework (see Sect. 3.4.1).
Another dialogue annotation tool is called LIDA (Collins et al. 2019). The authors argue that the quality of a dataset has a significant effect on the quality of a dialogue system, hence, a good dialogue annotation tool is essential to create the best annotated dialogue dataset. LIDA is the first annotation tool that handles the entire dialogue annotation pipeline from turn and dialogue segmentation through to labelling structured conversation data. Moreover, it also includes an interface for inter-annotator disagreements resolution.

Datasets for conversational dialogue systems
Regarding the evaluation of Conversational dialogue systems presented in Sect. 4, datasets derived from conversations on micro-blogging or social media websites (e.g. Twitter or Reddit) are good candidates, as they contain general-purpose or non-task-oriented conversations that are orders of magnitude larger than other dialogue datasets used before. For instance, Switchboard (Godfrey et al. 1992) (telephone conversations on pre-specified topics), British National Corpus (Leech 1993) (British dialogues many contexts, from formal business or government meetings to radio shows and phone-ins) and SubTle Corpus (Ameixa and Coheur 2013) (aligned interaction-response pairs from movie subtitles) are three datasets released earlier that have 2400, 854 and 3.35 M dialogues and 3 M, 10 M and 20 M words, 1 3 respectively. These sizes are relatively small if we compare to the huge Reddit Corpus 12 which contains over 1.7 billion comments, 13 or the Twitter Corpus described below.
Because of the limit on the number of characters permitted in each message on Twitter, the utterances are quite short, very colloquial and chat-like. Moreover, as the conversations happen almost in real-time, the conversations of this micro-blogging website are very similar to spoken dialogues between humans. There are two publicly available large corpora extracted from Twitter. The former one is the Twitter Corpus presented in Ritter et al. (2010), which contains roughly 1.3 million conversations and 125M words drawn from Twitter. The latter is a collection of 4232 three-step (context-message-response) conversational snippets extracted from Twitter logs. 14 This is labeled by crowdsourced annotators, who measure the quality of a response in a given context .
Alternatively, Lowe et al. (2015) hypothesized that chat-room style messaging is more closely correlated to human-to-human dialogues than micro-blogging websites like Twitter, or forum-based sites such as Reddit. Thus, they presented the above-mentioned Ubuntu Dialogue Corpus. This large-scale corpus targets a specific domain. Thus, it could accordingly be used as a task-oriented dataset for the research and evaluation of dialogue state trackers. However, it also has the unstructured nature of interactions from microblog services that makes it appropriate for the evaluation of non-task-oriented dialogue systems.
These two large datasets are adequate for the three subtypes of non-task-oriented dialogue systems: unsupervised, trained and utterance selection metrics. Notice that, additionally, some human judgments could be needed in some cases, such as in Lowe et al. (2017a) for the ADEM system (see Sect. 4.3.1). Here, they use human judgments collected via Amazon Mechanical Turk in addition to the evaluation using the Twitter dataset.
Apart from the afore-mentioned two datasets, the five datasets generated recently for bAbI tasks ) are appropriate for evaluation using the next utterance classification method (see Sect. 4.3.2). These tasks were designed for testing end-to-end dialogue systems in the restaurant domain, but they check whether the systems can predict the appropriate utterances among a fixed set of candidates, and are not useful for systems that generate the utterance directly. The ibAbI dataset mentioned in the next section has been created based on bAbI to cover several representative multi-turn QA tasks.
Another interesting resource is the ParlAI framework 15 for dialogue research, as it contains many popular datasets available all in one place with the goal of sharing, training and evaluating dialogue models across many tasks (Miller et al. 2017 (2015), which contains 21,133 dialogues. 14 https ://www.micro soft.com/en-us/downl oad/detai ls.aspx?id=52375 .

Datasets for question answering dialogue systems
With respect to QA dialogue systems, two datasets have been created based on human interactions from technical chats or forums. The first one is the Ubuntu Dialogue Corpus, containing almost one million multi-turn dialogues extracted from the Ubuntu chat logs, which was used to receive technical support for various Ubuntu-related problems (Lowe et al. 2015). Similarly, MSDialog contains dialogues from a forum dedicated to Microsoft products. MSDialog also contains the user intent of each interaction (Qu et al. 2018).
ibAbI represents another approach for creating multi-turn QA datasets (Li et al. 2017a). ibAbI interactivity adds to the bAbI dataset that was previously presented (see Sect. 6.2) by adding sentences and ambiguous questions with the corresponding disambiguation question, which should be asked by an automatic system. The authors evaluate their system regarding the successful tasks. However, it is unclear how to evaluate a system if it produces a modified version of the disambiguation question.
Recently, several datasets that are very relevant for the context of QA dialogue systems have been released. The CoQA (Conversational Question Answering) dataset contains 8K dialogues and 127K conversation turns (Reddy et al. 2018). The answers from CoQA are free-form text with their corresponding evidence highlighted in the passage. It is a multidomain dataset, as the passages are selected from several sources, covering seven different domains: children's stories, literature, middle and high school English exams, news, articles from Wikipedia, science and discussions from Reddit. QuAC (Question Answering in Context) consists of 14K information-seeking QA dialogues (100K total QA pairs) over sections from Wikipedia articles about people (Choi et al. 2018). What makes it different from other datasets so far is that some of the questions are unanswerable and that context is needed in order to answer some of the questions. Another similar dataset that has unanswerable questions and its questions are context-dependent is DoQA, a dataset for accessing domain-specific Frequently Asked Question sites via conversational QA (Campos et al. 2019). It contains 1,637 information-seeking dialogues on the cooking domain (7,329 questions in total). An analysis carried out by the authors showed that in this dataset there are less factoid questions than in the others, as DoQA focuses on open-ended questions about specific topics. Amazon Mechanical Turk was used to collect the dialogues for the three datasets.

Evaluation challenges
We conclude this section by summarizing some of the recent evaluation challenges that are popular for benchmarking state-of-the-art dialogue systems. They have an important role in the evaluation of dialogue systems, not only because they offer a good benchmark scenario to test and compare the systems on a common platform, but also because they often release the dialogue datasets for later evaluation.
Perhaps one of the most popular challenges is the Dialog State Tracking Challenge (DSTC), 18 which was previously mentioned in this section. DSTC was started in 2013 in order to provide a common testbed for the task of dialogue state tracking. It continued on a yearly basis with remarkable success. For its sixth edition, it was renamed as Dialog System Technology Challenges due to the interest of the research community in a wider variety of dialogue-related problems. Various well-known datasets have been produced and released for every edition: DSTC1 has human-computer dialogues in the bus timetable domain; DSTC2 and DSTC3 used human-computer dialogues in the restaurant information domain; DSTC4 dialogues were human-human and in the tourist information domain; DSTC5 also is from the tourist information domain, but training dialogues are provided in one language and test dialogues are in a different language. Finally, as the DSTC6 edition consisted of 3 parallel tracks, different datasets were released for each track, such as, a transaction dialogue dataset for the restaurant domain, two datasets that are part of Open-Subtitles and Twitter datasets, and different chat-oriented dialogue datasets with dialogue breakdown annotations in Japanese and English.
A more recent challenge that started in 2017 and continued into 2018, with its second edition being the Conversational Intelligence Challenge (ConvAI). 19 This challenge, conducted under the scope of NIPS, has the aim to unify the community around the task of building systems capable of intelligent conversations. In its first edition teams were expected to submit dialogue systems able to carry out intelligent and natural conversations about specific news articles with humans. The aim of the task of the second edition has been to model normal conversation when two interlocutors meet for the first time, and get to know each other. The dataset of this task consists of 10,981 dialogues with 164,356 utterances, and it is available in the ParlAI framework mentioned above.
Finally, the Alexa Prize 20 has attracted mass media and research attention alike. This annual competition for university teams is dedicated at accelerating the field of conversational AI in the framework of the Alexa technology. The participants have to create socialbots that can converse coherently and engagingly with humans on news events and popular topics such as entertainment, sports, politics, technology and fashion. Unfortunately, no datasets have been released.

Challenges and future trends
In the introduction, we stated that the goal of the dialogue evaluation is to find methods that are automated, repeatable, are correlated to human judgements, capable of differentiating among various dialogue strategies and explain which features of the dialogue system contribute to its quality. The main motivation behind this is the need to reduce the human evaluation effort as much as possible, since human involvement creates high costs and is highly time-consuming. In this survey, we presented the main concepts regarding evaluation of dialogue systems and showcased the most important methods. However, evaluation of dialogue systems is still an area of open research. In this section, we summarize the current challenges and future trends that we deem most important.
Automation The evaluation methods covered in this survey all achieve a certain degree of automation. However, the automation is achieved with significant engineering effort, or by loss of correlation to human judgements. Word-overlap metrics (see Sect. 4.3.1), which are borrowed from the machine translation and summarization community, are fully automated. However, they do not correlate with human judgements on the turn level. On the other hand, BLEU becomes more competitive when applied on the corpus-level or systemlevel Lowe et al. 2017a). More recent metrics such as BLEU and ADEM (see Sect. 4.3.1) have significantly higher correlations to human judgements while requiring a significant amount of human annotated data as well as thorough engineering.
Task-oriented dialogue systems can be evaluated semi-automatically or even fully automatically. These systems benefit from having a well-defined task, where success can be measured. Thus, user satisfaction modelling (see Sect. 3.4.1) as well as user simulations (see Sect. 3.4.2) exploit this to automate their evaluation. However, both approaches need a significant amount of engineering and human annotation: user satisfaction modelling usually requires prior annotation effort, which is followed by fitting a model that predicts the judgements. In addition to this effort, the process has to be potentially repeated for each new domain or new functionality that the dialogue system incorporates. Although in some cases the model fitted on the data for one dialogue system can be reused to predict another dialogue system, this is not always possible.
On the other hand, user simulations require two steps: gathering data to develop a first version of the simulation, and then building the actual user simulation. The first step is only required for user simulations that are based on training corpora (e.g. the neural user simulation). A significant drawback is that the user simulation is only capable of simulating the behaviour which is represented in the corpus or the rules. This means that it cannot cover unseen behaviour well. Furthermore, the user simulation can hardly be used to train or evaluate dialogue systems for other tasks or domains.
Automation is thus achieved to a certain degree, but with significant drawbacks. Hence, finding ways to facilitate the automation of evaluation methods is clearly an open challenge.
High quality dialogues One major objective for a dialogue system is to deliver high quality interactions with its users. However, it is often not clear how "high quality" is defined in this context or how to measure it. For task oriented dialogue systems, the mostly used definition of quality is often measured by means of task success and number of dialogue turns (e.g. a reward of 20 for task-success minus the number of turns needed to achieve the goal). However, this definition is not applicable to conversational dialogue systems and it might ignore other aspects of the interaction (e.g. frustration of the user). Thus, the current trend is to let humans judge the appropriateness of the system utterances. However, the notion of appropriateness is highly subjective and entails several finer-grained concepts (e.g. ability to maintain the topic, the coherence of the utterance, the grammatical correctness of the utterance itself, etc.). Currently, appropriateness is modelled by means of latent representations (e.g. ADEM), which are derived again from annotated data.
Other aspects of quality concern the purpose of the dialogue system in conjunction with the functionality of the system. For instance, Zhou et al. (2018) define the purpose of their conversational dialogue system to build an emotional bond between the dialogue system and the user. This goal differs significantly from the task of training a medical student in the interaction with patients. Both systems need to be evaluated with respect to their particular goal. The ability to build an emotional bond can be evaluated by means of the interaction length (longer interactions are an indicator of a higher user engagement), whereas training (or e-learning) systems are usually evaluated regarding their ability of selecting an appropriate utterance for the given context.
The target audience plays an important role as well. Since quality is mainly a subjective measure, different user groups prefer different types of interactions. For instance, depending on the level of domain knowledge, novice users prefer instructions that use less specialized wording, whereas domain experts might prefer a more specialized vocabulary.

3
The notion of quality is thus dependent on a large amount of factors. The evaluation needs to be adapted to take aspects such as the dialogue system's purpose, the target audience, and the dialogue system implementation itself into account.
Lifelong learning The notion of lifelong learning for machine learning systems has gained traction recently. The main concept of lifelong learning is that a deployed machine learning system continues to improve by interaction with its environment (Chen et al. 2016). Lifelong learning for dialogue systems is motivated by the fact that it is not possible to encounter all possible situations during training, thus, a component that allows the dialogue system to retrain itself and adapt its strategy during deployment seems the most logical solution.
The evaluation step is critical in order to achieve lifelong learning. Since the dialogue system relies on the ability to automatically find critical dialogue states where it needs assistance, a module is needed which is able to evaluate the ongoing dialogue. One step in this direction is done by Hancock et al. (2019), who present a solution that relies on a satisfaction module that is able of to classify the current dialogue state as either satisfactory or not. If this module finds an unsatisfactory dialogue state, a feedback module asks the user for feedback. The feedback data is then used to improve the dialogue system.
The aspect of lifelong learning brings a large variety of novel challenges. Firstly, the lifelong learning system requires a module that self-monitors its behaviour and notices when a dialogue is going wrong. For this, the module needs to rely on evaluation methods that work automatically, or at least semi-automatically. The second challenge lies in the evaluation of the lifelong learning system itself. The self-monitoring module as well as the adaptive behaviour need to be evaluated. This brings a new dimension of complexity into the evaluation procedure.

Conclusion
Evaluation is a critical task when developing and researching dialogue systems. Over the past decades, many methods and concepts have been proposed. These methods and concepts are related to the different requirements and functionalities of the dialogue systems. These are subsequently dependent on the current development stage of the dialogue system technology. Currently, the trend is moving towards building end-to-end trainable dialogue systems based on large amounts of data. These systems have different requirements for evaluation than a finite state, machine-based system. Thus, the problem of evaluation is evolving in tandem to the progress of the dialogue system technology itself. This survey presents the current state-of-the-art research in evaluation.