1 Introduction

Robots are rapidly becoming a valuable resource in many area such as homes, offices, shops, museums, and hospitals [1]. In particular, robots in hospitals can handle repetitive tasks such as checking appointment or prescriptions on behalf of human receptionists. In order to interact with patients in hospitals, robots with task-oriented dialogue systems are essential, and if replaced with human operators, a positive effect can be expected in terms of reducing time and effort. In our previous work [1], we designed a receptionist robot to work in a hospital reception environment, but some components in the system, such as dialogue system and gesture generation, relied on purely rule-based.

Many researchers have developed task-oriented dialogue systems used in a Human-Robot Interaction (HRI) [2,3,4,5,6,7]. These robots have adopted conventional task-oriented dialogue system, which contains several components connected in a pipeline. In this approach, natural language understanding (NLU) identify user’s intent and extract semantic information (slot values) from the recognized user’s utterance. Then the output of NLU is passed to the dialogue state tracking (DST). DST maintain a distribution of dialogue states, which contain user’s intent and slot value expressed by a user so far, over the past dialogue histories. The distribution output is passed on to the dialogue policy module, where the next available system action is selected. The system action can be represented as a semantic frame which has action name (i.e. confirms a user request, request user’s name and so on) with entity value (i.e. name, address, age and so on). The generated system action is then passed to natural language generation (NLG) to generate an actual response.

Recently, with the success of chit-chat system based on end-to-end trainable neural network models [8, 9], researchers have started exploring end-to-end approaches to solve such difficulties in the pipeline approaches. The end-to-end methods are mainly based on an idea that recurrent neural networks (RNNs) can be directly trained on text transcripts of dialogues to represent distributed dialogue representations. With the benefit of RNNs, the end-to-end approaches tend to use a single module to generate a response rather than separate modules as in the pipeline methods.

We address a research question on how we can apply end-to-end dialogue system to the robot. A different approach may be necessary to build a robot system because end-to-end based dialogue system which calculates the dialogue state with a trained hidden state cannot manually define a robot’s behavior such as a gesture, expression and so on. We aim to fill this gap by demonstrating how we build a receptionist robot using the end-to-end dialogue system. There are two aspects of HRI with the end-to-end approach: 1) how end-to-end dialogue system is applied to the robot? 2) how we make the robot express its behavior?

For the sake of this, we propose a robot dialogue system that can generate responses and gestures according to user input. Note that we only focus on generating robot’s gesture as a first attempt to HRI with end-to-end approach. We utilized Hybrid Code Network (HCN) [9] and extended it to produce a response with selected gesture. We applied Recurrent Neural Network (RNN) to select the robot’s gesture depending on the system response generated from HCN. The dialogue system is then integrated into a robot and deployed as a real receptionist. In order to examine the feasibility of the proposed system, we conduct an experiment on real users and compare it with the rule-based system. As a evaluation metrics, we use PARADISE framework and God speed test.

The main contributions of this work as follows:

  • We propose robot dialogue system that can produce responses and gestures in an end-to-end manner to user input.

  • We conduct a comparative experiment on real users between the proposed and baseline system.

  • The experimental results shows that the proposed system achieves better dialogue efficiency, which can complete a given task more efficiently.

  • It also shows that the proposed system is more efficient in terms of development speed.

This paper is organized as follow. Section 2 presents our extended HCN for HRI. Section 3 describes system components of our receptionist robot. Section 4 presents an experiment of receptionist robot with real users. Section 5 summaries and concludes this paper.

2 Related Work

2.1 Dialogue Systems for Service Robot

In the previous studies, many researchers have developed a task-oriented dialogue system used in HRI. Finite State Machine (FSM) and slot-based methods have been applied along with gestures such as facial expressions, gestures, and so on [2,3,4]. Statistical approaches such as Partially Observable Markov Decision Process (POMDP) have been applied to the dialogue systems to maintain a distribution of possible dialogue states [5, 6]. Reinforcement learning has been applied to combine chat and task-based conversation for the dialogue system [7]. However, these systems are based on the traditional pipeline approach and rule-based behavior selection and have several common drawbacks. According to [9], it is often unclear how the dialogue state is defined and what the dialogue history is maintained to select system behavior based on the current dialogue state. Moreover, the traditional approaches are expensive and time consuming to deploy and make it difficult to scale to new domains [10].

Fig. 1
figure 1

The overall structure of the extended HCN. It consists of entity handling, response selector, entity output and gesture selector. Except for the gesture selector, the rest of the modules were implemented in the same way as the original HCN

2.2 Recent Trends in Dialogue System

With the recent success of chit-chat systems based on learnable end-to-end neural network models [8, 11], researchers have begun to look for an end-to-end approach to address these challenges in the traditional approaches. The end-to-end methods are mainly based on the idea that RNNs can be trained directly in the text transcripts of the dialogue to represent the distributed dialogue representations. Due to the benefits of RNNs, the end-to-end approaches tend to use a single module to generate responses rather than separate modules as in traditional methods.

Bordes et al. [12] developed an end-to-end learnable framework using end-to-end memory networks (MemN2N) [13], which consists of interfering modules and readable and readable memory components that can be read and written. In similar studies, other researchers explored approaches using gated end-to-end memory networks [14], query reduction networks [15], and copy-augmented sequence to sequence network [10]. However, according to Williams et al. [9], these pure RNN-based approaches have been found to lack a general mechanism for injecting domain knowledge. Domain knowledge injection can be easily solved with a few lines of code, but the models mentioned earlier require thousands of conversations to learn these simple actions. To address these limitations, they introduce a practical RNN-based end-to-end framework called HCN.

3 Dialogue System for Hospital Receptionist Robot

We use HCN and extended it with a gesture selector to produce both responses and gestures as shown in Fig. 1. HCN has domain-specific components (rules-based) and a neural network-based response selector that tracks the state of latent dialogue states. The details of each component are described in the following paragraphs.

Once the user’s utterance is provided, it is transformed into four different feature vectors: word embedding, bag of word, context feature, and action mask. We use pretrained word embeddings to get the embedding vector \(emb_t\) at time t. The entity processing module, which is part of a domain-specific component to generate context and action mask vectors, identifies the entity in the user’s utterance \(U={u_1,u_2,\ldots ,u_T}\) with \(T\) words and keeps the identified entity. Then, if a new entity is identified later in the user’s utterance, the old entity is replaced with the newly identified entity. Finally, the module generates an action mask \(am_t\) and context feature vector \(c_t\) at time t as part of the input in the response selector.

These feature vectors are now concatenated as an input feature vector \(f\) in the response selection and can be represented as:

$$\begin{aligned} f_t=[emb_t\oplus \ b_t\oplus \ am_t\oplus \ c_t]. \end{aligned}$$
(1)

The response selection module (Fig. 2) consisting of LSTM [16], density and softmax layers determines the response \({\hat{r}}\) based on the input feature vector \(f_t\) as shown in Fig. 2. To determine the response \({\hat{r}}\), the LSTM is fed with feature vectors \(f_t\) until time t-1. In the final step of this module, the LSTM receives the feature vector \(f_t\) and generates an 11-dimensional probability distribution that is same size as the response extracted from dialogue dataset we used.

The entity output module generates a fully formatted response based on the response template selected in the response selector. For example, if an action template “api_call location <location>” selected the previous module, then entity output fills stored entity as “api_call location bathroom”.

Fig. 2
figure 2

Overview of the response selector consisting of three layers: embedding, LSTM, dropout, dense and softmax layers. This module predicts the next response based on the combined features

Fig. 3
figure 3

Overview of a trainable gesture selector consisting of embedding, RNN, dense, dropout and softmax layers. This module selects the appropriate robot gesture based on the selected response

The trainable gesture selector (Fig. 3), based on the idea of intent classification [17, 18], consists of three layers: embedding, LSTM, dropout, dense and softmax layers. When the HCN generates a response, each word in the response is tokenized and transformed into a vector using an embedding layer. It is then sequentially input to the LSTM layer, the dense layer and the softmax layer outputting the probability of a gesture label matching one of the robot’s pre-defined motion controls.

4 Hospital Receptionist Robot System

The receptionist robot system consists of four components: sensory perception, extended HCN, Social Human-Robot Interaction framework (SHRI) [19], and robot platform, as shown in Fig. 4. The automatic speech recognition and face detection module serves as the robot’s sensory recognition. The extended HCN is responsible for generating the robot’s response with gestures that the robot should take. The SHRI framework acts as a bridge between the robot platform and external modules such as sensory recognition and dialogue systems. It manages non-verbal functions such as turn taking, gaze, emotions, gestures and more. All components are developed based on Robot Operating System (ROS). We use NAO, which is a humanoid robot that is widely used for research and educational purposes, to test the system as shown in Fig. 5. The details of each module are described in the following paragraphs.

Fig. 4
figure 4

Overall structure of receptionist robot system. It consists of a speech recognition, face detection, extended HCN, Social Human-Robot Interaction (SHRI) framework and robot platform

Fig. 5
figure 5

Humanoid Robot NAO. We chose to test the receptionist robot system

First, the speech recognition part of the sensory perception module is built with Google Cloud Speech applying a neural network model to convert audio into text. Face recognition, which is also part of the sensory recognition module, uses a face recognition module provided by the NAO robot platform. Here the module has been slightly modified to track the target face in front of the robot.

Second, the extended HCN is a dialogue system that infers what response is required and which gesture is required based on the recognized user’s utterance. To train the dialogue system, we used the hospital receptionist dataset introduced in [20]. The dataset consists of conversations between humans and the system and covers four different tasks: request a prescription, confirm an appointment, ask for a wait time, and ask for a location.

To train the gesture selector, we extract a total of 19,671 sentences with 8 corresponding labels (multiple choice, welcome, request, greeting, inform, confirm answer, thanks, closing) from the conversation corpus provided by the Microsoft conversation challenge [21]. Additionally, the trained HCN is loaded into the system as a dialogue system, and if the highest probability of the selected response or the confidence of the ASR is less than 50%, we designed it to re-prompt the user so that we can build an interactive application for practical purpose.

Lastly, to integrate with the robot platform, we used the SHRI framework, a modular human robot software that manages the robot’s social behavior and domain tasks. By separating into domain tasks that control the execution flow of scenarios and social behavior as a framework, developers can reduce their efforts to implement non-verbal features of the robot such as turn taking, gaze, emotions, and gestures and so on. The domain task is considered conversational task in this work.

The framework consists of three main components: Social perception, Social task controller and Action renderer, as shown in Fig. 4. Social perception interprets situations based on the output of sensory perception modules. For example, audio-visual saliency is continuously evaluated, the user’s turn-taking intention is inferred, and cognitive and emotional state of interaction participants are estimated. Social task controller accomplishes action requested by the domain task. It uses tag information such as saying, gazing, pointing, facial expressions and so on. For example, if domain task request “<sm=tag:greeting> Hello. My name is Silbot”. Then, it generates a portable motion command format called semantic motion. Action renderer, which is a robot dependent component, is responsible for executing actual motor control. It interprets the semantic motion generated from the social task controller module.

Fig. 6
figure 6

User study is conducted to evaluate our proposed system. The slot-based method which is widely used in commercial products was set as the baseline

5 Experiment

5.1 Experimental Environments

The human-robot dialogue system was evaluated through a user study in which human subjects interacted with NAO acting autonomously using the system described above. During each session, all interactions were conducted in English and in the participant’s seat in front of the robot and the experimenter’s seat next to the robot, the participant asked for assistance if needed, as shown in Fig. 6. The participants were also given a description of the overall hospital reception scenario, such as the tasks that had to be completed. The scene was recorded from the participant’s point of view centered on a robot. The time for each session did not exceed 20 minutes.

A total of 20 people (7 males, 13 females) agreed to participate in our study. Their ages had range from 19 to 30 (M=24.1, SD=3.4), where M denotes mean value and SD denotes standard deviation. Participants did not receive any financial compensation, and most of them were students with little or no previous experience in interaction with a robot. We assign half of the participants to interact with the proposed system and the rest to the baseline for a fair comparison of each condition.

To explore the benefits of the proposed system, we compared two conditions: a robot using the proposed system and a robot using the conventional method. We use the slot-based method which is workhorse of the conventional pipeline dialogue system as a baseline. The slot-based method is now widely used in commercial conversation systems such as Google Dialogue Flow, Amazon Lex, and IBM Watson, and predefines the structure of the conversation state with a set of slots to be filled during the conversation [12].

We use Google’s Dialog FlowFootnote 1 to implement the baseline system. Here we call our API to query our knowledge base with seven intents (welcome, prescription, check-in, silence, location, farewell and replacement) and 4 different slots (name, address, time and location). Regarding robot gesture selection, we make the baseline to select gestures according to a rule-based method by manually defining the gestures used for each response. A between-subject design was used to compare the two conditions. Thus, different participants are assigned to different conditions.

Table 1 Dialogue efficiency and quality results in both conditions. M represents the mean value and SD represents the standard deviation

We collect various objective measurements from the log files and video recordings. We consider two metrics for objective measures used in the PARADISE framework[22, 23]: dialogue efficiency and quality. Dialogue efficiency is assessed using elapsed time, the number of tasks completed, and the number of utterances made by the user and the robot during the experimental session. The dialogue quality is measured by the number of timeouts, the number of re-prompt, and the confidence of ASR. Specifically, the timeout refers to the number of times the user misses the speaking time in the recognition section of the robot. The re-prompt is the number of times the robot asks the user the same question to obtain specific information. Note that the robot system is designed to ask the user the same question if the response probability or ASR confidence is less than 50%.

To explore subjective measures, the participants were asked to fill out a questionnaire to analyze perceptions of the robots. The questionnaire includes user’s overall rating and the God speed test [24], which is a measurement tool for HRI with five key concepts; anthropomorphism, animacy, likeability, perceived intelligence and perceived safety.

An experimental scenario was designed to show how well the robot works as a receptionist robot. Participants were asked to complete the given tasks as many as possible (ask for prescription, checking in for doctor’s appointment, asking waiting time, asking location of bathroom). Every participant was asked to imagine that they were entering a hospital that they had never been to before where the robot was installed in the reception area interacting with the patients. Before starting the experiment, they were asked to use natural language spontaneously. Moreover, they were provided with hints on how better communication with the robot. For example, “please wait for your turn to speak”and“please keep in mind that the robot only listens to you while its eyes turn blue”.

5.2 Experimental Result

Table 1 shows the experimental result in which participant successfully performed the given tasks under the both conditions. Here, we perform a one-tailed T-test to determine the statistical difference between the two conditions. The null hypothesis \(H_0\) and its alternative \(H_a\) for the both conditions can be described as follows:

$$\begin{aligned} H_0: \mu _{C1}=\mu _{C2},\quad H_a: \mu _{C1}\ne \mu _{C2}, \end{aligned}$$
(2)

where \(c_1\) and \(c_2\) represent each condition. Then, the significance level \(\alpha \) set to 0.05.

The average number of tasks completed shows 3.8 (SD=0.42) in the proposed method and 3.9 (SD=0.32) in the baseline, with no significant difference (one-tailed T-test, p = 0.24). However, there is a significant difference in the number of the user turns, and robot turns, and elapsed time (one-tailed T-test, p = 0.028, 0.03 and 0.003, respectively). More specifically, the user with the proposed method has an average of 1.4 turns less than the baseline method. Moreover, the robot with the proposed method has an average of 1.8 turns less, and the users have an average of 38.4 seconds shorter interaction time.

In terms of dialogue quality, the result of data analysis shows that the number of time out, re-prompts and speech recognition all have no significant difference (one tailed T-test, p = 0.07, 0.18 and 0.32, respectively). However, we find that the proposed method was more likely to re-prompt an average 0.5 more.

We also analyze the result of robot perception and user satisfaction questionnaire to find the acceptability of our proposed system compared to the baseline. The validity of the used questionnaire was tested by measuring its internal consistency with Cronbach’s \(\alpha \), which was equal to 0.89 (good consistency). Based on this value, we assume that our participants in the given context interpreted the robot characteristics, provided in the questionnaire, in an expected way. We averaged the 5-point Likert scale of the questionnaire we collected. As a result, we could not find any significant difference between two methods; trainable gesture selector and conventional method which manually defines gesture in the response (one tailed T-test, p = 0.24 and 0.39, respectively).

6 Discussion

Both conditions have shown similar results in the experiment with real users, but our proposed system shows better results in terms of dialogue efficiency. To gain more insight into this experiment, we performed a detailed analysis of the dialogue log files and recording, and it revealed that the proposed model tends to comprehend the dialogue context than the baseline system. Fig. 7 shows an example of dialogue between the proposed system and the user. In the same way as work presented in [22], the whole dialogue can be divided into four different sub-dialogues which is same as given tasks; check-in (\(\hbox {U}1\sim \hbox {R}5\)), collect prescription (\(\hbox {U}7\sim \hbox {R}7\)), ask waiting time (\(\hbox {U}8\sim \hbox {R}8\)), and ask location (\(\hbox {U}9\sim \hbox {R}9\)). Silence means that user did not provide any utterance. It shows that the collected information such as name and address carry over to the different tasks naturally (from check-in to collect prescription in this case) and may notice conversation is going to be over (\(\hbox {U}10\sim \hbox {R}10\)).

Fig. 7
figure 7

An example of dialogue between the proposed system and real user

On the other hands, the baseline system requests the information (\(\hbox {U}8\sim \hbox {R}10\)) that has already been collected from the previous task (\(\hbox {U}1\sim \hbox {R}6\)) as shown in Fig. 8. We found that the baseline system is not suitable for complex conversation in terms of dialogue efficiency. We found that the baseline system to be a bit inefficient at completing the given tasks, but it can still be solved with handcrafted rules. In practice, the scenarios are more complicated and it cannot be easily solved with these handcrafted rules. However, the approach we proposed has a potential to build such latent rules as the scenario become a large and complex.

By analysing the questionnaire, there is no significant difference between both conditions. In both methods, it can be said that the gestures selected according to the robot’s responses are recognized equally by the users. However, we find that the proposed method is still superior in terms of development efficiency. This is because the proposed method automatically selects the appropriate gesture from the robot’s response, whereas in the baseline, the developer set each gesture based on the response in advance.

Fig. 8
figure 8

An example of dialogue between the baseline system and real user

7 Conlusion and Future work

We presented and evaluated our autonomous hospital receptionist robot, using an end-to-end approach to generate response and gesture. For this, we extended HCN to select not only a response but also a proper gesture based on the generated response. We experimented with real users. We found that our proposed system has an advantage in terms of dialogue efficiency, which indicate how efficient users were in achieving the given tasks with the receptionist robot. Moreover, participants found no difference between the proposed system and baseline system in terms of the robot’s perception. It means that no more handcrafted works are required to define the robot’s gesture according to the robot’s response.

In future work, it will be possible to make several improvements to extend the realm of possibility for the receptionist robot with the end-to-end approach. The dialogue system, it was limited to a rather small domain. Tests on others, perhaps broader, the domain would be needed to see the resulting scale. On the side of the robot’s behaviour, we have tested how the users feel when using the receptionist robot. However, the robot is rather equipped with minimal features at this stage; verbal interaction with the robot’s gesture and gaze would be good examples. To improve the user experience, we can extend our work to more diverse forms such as robot’s expression, voice pitch and so on