Keywords

1 Introduction

As the WHO reports, over the past few years, the number of patients waiting to be seen by psychiatrists has been growing constantly (the WHO highlights urgent need to transform mental health and mental health care, 2022). Unfortunately, waiting for a long time for an appointment or therapy is a disadvantage for psychiatric patients. As more people struggle with mental disorders, new solutions must be found to help patients. What could be helpful is the use of innovative technology to supplement the therapist’s work. Some therapies, like cognitive behavioral therapy (CBT) (Fenn & Byrne, 2013) or exposure techniques in virtual reality (Sherrill & Rothbaum, 2023), have a straightforward, well-defined, and organized structure that enables them to be integrated into technological tools efficient (Dino et al., 2019). One example could be using chatbots for psychological therapies, e.g., the chatbot Woebot, designed for English-speaking individuals, which has helped treat depression (Fitzpatrick et al., 2017). Another example is an avatar-based dialogue system, although with human supervision, effectively treating auditory hallucinations (Craig et al., 2018; Stefaniak et al., 2019). When talking about these dialogue agents, one must bear in mind the different input signals these systems might receive. To achieve a similar level of intuitive interaction between humans and computers, the use of multimodal interfaces is required (Duarte, 2007). From simple conversations to therapy sessions with specialists, human-to-human interaction is naturally multimodal, using speech, gesture, gaze, etc. In literature, multimodality is described as the combination of multiple senses and modes of communication, including sight, sound, print, images, video, and music, that make up a message (Dressman, 2019). To find better and more human-like results, multimodality is becoming one of the most important research directions to be followed. In the digital age, multimodality has become even more important for communication.

Multimodal fusion is a technique that uses a combination of different input modalities to enable systems to understand human behavior better. By combining multiple data sources, user intent can be interpreted more accurately and bring interaction closer to what users are used to from human-to-human interaction (Duarte & Carriço, 2006). As technology evolves, multimodal fusion will likely become an integral part of human–machine interaction. The popularity and frequency of multimodal capabilities are already increasing. One example could be the use of the eye-tracker in the field of human–computer interaction (Chandra et al., 2016; Majaranta & Bulling, 2014; Santini et al., 2017). Most eye-trackers use the near-infrared light spectrum to track eye movement interaction (Mulvey et al., 2008). The most important parameters obtained during an eye-tracking session include pupil diameter and eye movement parameters (saccades and fixations), among many others (e.g., number of blinks, blink duration, time to first fixation—TTFF) (Duchowski, 2017). Saccades are fast, simultaneous movements of both eyes between two or more fixation points. The choice of a particular parameter for analysis always depends on the purpose of the experiment being performed. Therefore, the eye-tracking signal has become of interest to a broader public, e.g., in psychology (Hershaw & Ettenhofer, 2018; Krejtz et al., 2018; Pfleging et al., 2016), education (Was et al., 2016), marketing (Białowąs & Szyszka, 2019; Wedel, 2014), medicine (Bartošová et al., 2018). For years, eye-trackers have also been used to study human–computer interaction. An example in this area is Embodied Conversational Agents (ECAs), which have received much attention. They use nonverbal behavior to establish contact with a human user (Bailly et al., 2006; Bee et al., 2009). For such a dialogue to become more reliable, these agents should be equipped with communicative and expressive abilities similar to those we know from human-to-human interaction (speech, gestures, facial expressions, gaze, etc.) (Bee et al., 2009). Increasingly, eye-trackers are used as an additional source of information during users’ conversations with a dialogue agent (Bailly et al., 2006; Bee et al., 2009). Thus, the data obtained are then used to support or control conversational agents. An example of dialogue agents that take into account the behavior of the participant’s eye movements in various aspects are: Gandalf (a humanoid agent), which can narrate planets and moons (Thórisson, 1997), an interactive storyteller Emma (Bee et al., 2010), or a dialogue agent who interacts with seniors (Amorese et al., 2022). Thanks to eye-tracking studies, it is possible to obtain several parameters characterizing, for example, a person’s emotional state (Bradley et al., 2008) and concentration (Chang & Chueh, 2019). The human–computer interaction that is carried out is more natural, and as a result, dialogue agents can interact more realistically with people. It is a perfect example of how multimodal fusion can create immersive user experiences.

Concerning the growing problem in the domain of healthcare, our research team has developed a goal-oriented dialogue system called Terabot, which implemented elements of CBT. Initial results from dialogues with seven psychiatric patients are presented in Gabor-Siatkowska et al. (2023a, 2023b). We observed the accuracy of speech and intent recognition and compared the results with those of healthy subjects. During the therapy sessions at the Institute of Psychiatry and Neurology in Warsaw, Poland, we additionally used the Gazepoint GP3 eye-tracker to collect eye-tracking data from psychiatric patients while they interacted with the dialogue system. In this chapter, we will describe the problems encountered during these dialogues between the patient and the agent. Furthermore, we will show how integrating the eye-tracking signal into the dialogue system may prevent errors and miscommunication. Our chapter is structured as follows: Sect. 1 briefly reviews related work in our research field. Next, in Sect. 2, we describe Terabot, the therapeutic spoken-dialogue system. In Sect. 3, the conducted experiments are described. In Sect. 4, we present the ongoing problems that occurred during conversations between the patients and our dialogue agent. Section 5 describes our proposed solutions for these problems. The chapter concludes with a discussion and further work.

2 Terabot—The “Empathetic” Dialogue System

As mentioned earlier, the number of psychiatrists is insufficient due to the growing mental health problems in today’s societies (the WHO highlights the urgent need to transform mental health and mental health care, June 17, 2022). As a response to this issue, to support psychiatric patients, we have developed “Terabot,” a goal-oriented therapeutic dialogue system. This speech-to-speech system operates in the Polish language. It is designed to meet the needs of psychiatric patients dealing with complex and overwhelming emotions. It can be used to help them recognize their feelings and behaviors in difficult situations. In each session, the patient can choose from three emotional topics such as anger, shame, and fear. At the end of the session, Terabot offers a relaxation exercise to help the patient to calm down. Importantly, and to benefit patients, all of Terabot’s responses have been reviewed and edited by psychiatrists. The dialogue flow has been designed according to psychiatrist recommendations using CBT therapy elements.

  • The Architecture of Terabot

Our goal was to minimize the risk of the robot making a wrong response, so we designed a goal-oriented dialogue system. Figure 1 shows a schematic diagram of the architecture of our conversational agent.

Fig. 1
A flow diagram of a dialogue system. It has 2 databases, training data and a dialog model. The training database recognizes a patient's speech and a decision is made on the next action.

Block diagram of Terabot dialogue system

When a patient starts speaking, an automatic speech recognition (ASR) system converts this speech to text (using GoogleWeb Speech API). This text is then further analyzed to identify the intents and slots for which the Dual Intent and Entity Transformer (DIET) classifier (Bunk et al., 2020), a multitask transformer, was used. It can handle both intent and entity recognitions. DIET uses a sequence model that considers the sequence of words. At the same time as the intent and slot recognition, the output of the ASR is passed to the text-based emotion recognition module (Zygadło et al., 2021). It is based on the Bidirectional Encoder Representations from Transformers (BERT) model with fine-tuning for emotion classification. The currently detected emotional state of the patient’s spoken text determines the value of the “emotion” slot. Another slot is filled with the exercise type. The recent further development of emotion and sentiment recognition using big data and deep learning methods used in our dialogue system has been reported in our article (Kozłowski et al., 2023). Our system uses RASA to implement the action decision pipeline and the DIET classifier. RASA is an open-source dialogue management and natural language understanding (NLU) framework. This framework uses a weighted combination of the memoization policy (e.g., based on stories stored in memory), a rule policy, and a Transformer Embedding Dialogue (TED) (Vlasov et al., 2019) to decide the next system action. The TED policy takes into account the current state of the dialogue (patient’s intent), slot values (including the recognized emotion), and previous states of the dialogue. If the next action is an utterance, it will be selected from the utterance database. The last step is to transform the selected text (utterance of Terabot) into a speech signal. The result is Terabot’s response given to the patient. Since Terabot is a domain-specific dialogue system, we faced difficulty acquiring large samples of real data to increase the training database. In our research (Gabor-Siatkowska et al., 2023a, 2023b), we present how to enlarge the dataset using another commonly known AI tool, such as ChatGPT. Our proposed solution enabled the increase of the dialogue system’s intent recognition accuracy of the spoken utterances by patients.

3 Sessions with Terabot

The therapy sessions with Terabot took place between March and August 2023 at the Institute of Psychiatry and Neurology in Warsaw, Poland. This study was conducted in accordance with the Declaration of Helsinki and approved by the Ethics Committee of the Institute of Psychiatry and Neurology in Warsaw. The first experiments with seven psychiatric patients talking to Terabot, which gave us 600 recordings, showed promising results (Gabor-Siatkowska et al., 2023a, 2023b). Most patients confirmed that Terabot understood their answers and emotions (around 3.5 on the Likert scale). All patients rated Terabot’s speech as high quality, natural, and fast and liked how it was presented.

In total, there were 38 participants between the ages of 18 and 65 who took part in our study. All participants were admitted to a 24-h psychiatric hospital. There, they have been examined and classified by psychiatrists for additional therapy sessions with Terabot based on their diagnosed conditions. These patients have been diagnosed with F20.0 to F20.9 (schizophrenia) according to ICD-10 (WHO, International Classification of Diseases). They were treated with medication, including antipsychotics, mostly in combination with other antidepressants or mood stabilizers. They were randomly selected for the study after meeting the inclusion criteria and signing a consent form. It was a randomized clinical trial with random assignment to different experimental conditions. Psychiatrists informed them about the purpose and conduct of the clinical trial and addressed any questions they might have had. Patients approved to record their voice and eye-tracking data. For conversational agents, which are designed for psychiatric patients, it is important that no image/video of the patient will be recorded during these conversations, as neither patients would consent nor would we get the bioethics committee approval. The anonymity of patients participating in the study was guaranteed.

In five sessions per week, this is one session per day, the patients talked to Terabot. Each conversation lasted for about 7–15 min. The patients could choose one of the three exercises: anger, fear, or shame, which could be then repeated or changed the next day. A fragment of such an exercise is shown in Fig. 2. The patients were sitting in front of Terabot, with an assistant invited to the same room. This was necessary because of the psychiatric institute’s requirements for the patient’s safety. Figure 3 shows a demonstration setup with a monitor and an eye-tracker at the bottom of the screen.

Fig. 2
A block diagram of a conversation between the Terabot and a patient. The robot asks the patient if they have felt embarrassed in a situation. The patient responds with affirmation.

The diagram is a fragment of a conversation in case the patient chooses the exercise on shame. As can be seen, the intention in the patient’s utterance is analyzed, and the next Terabot response is matched to it

Fig. 3
A photograph of a woman watching a computer screen that displays a man in a mask. The display is kept on a desk.

Patient sitting in front of Terabot’s interface. Their dialogue is conducted speech to speech. The eye-tracker is situated at the bottom of the screen

4 Frequent Problems During Dialogues

Despite the extensive testing, some unforeseen problems occurred during the therapeutic sessions. They caused some unintentional discomfort to the patients, sometimes leading to problems and misunderstandings during the conversation. As explained in Sect. 2, the RASA framework is responsible for sending follow-up utterances after receiving the appropriate information about the recognized intention of the patient’s speech. Even though we adapted RASA appropriately, as recommended by its guidelines, we found patterns of problematic issues. Some problems also occurred during the dialogues, which assistants have reported. All of these are described below.

  • Issue 1: Waiting Too Long for the Patient’s Answer

The first common problem is presented in Fig. 4. There have been frequently occurring situations in which the patient did not respond (to a question posed by Terabot) for an extended time. In these situations, there was no message from or to Terabot. It was also impossible for the assistants to ask the patient anything or to repeat the question of the dialogue agent—this would have interrupted the dialogue flow. The assistant did not know how the system would react to such a situation—if it would respond and remind the patient or hang up because it waited too long for the response.

After the conversation ended, the patients were asked about the long waiting time for their answers. Some of them responded that they were thinking about what the best answer to the question would be; others claimed that their thoughts were somewhere else, and it took them some time to come back to the conversation and give an answer. In all those situations, the dialogue system waited the whole time. It was not clear if the patient would eventually respond or not at all—they also might have left the room and the dialogue agent would still wait for an answer. Such situations had to be handled appropriately.

When it comes to the patient’s speech, this was a challenge for our dialogue system. The patients who took part in our trials differed in terms of velocity and complexity of their responses. Some patients answered quickly and superficially, sometimes with low speech volume. Others provided complete, brief answers without much explanation. Another group of patients responded verbosely, without waiting for Terabot to finish the question. Yet another group of patients raised their voices at Terabot, literally shouting, regardless of how extensive or complex their response was. There were also patients suffering from logorrhea. It is a communication disorder that causes excessive use of words and repetition, which can lead to incoherence. It is sometimes called pressure of speech (Karbe, 2014).

Long before the trials in the hospital, our research group experimented with the Terabot’s waiting-for-the-response mode. We tested how fast and for how long the ASR system should be active after Terabot’s statement/question is finished during conversations. Predicting how long a patient will wait before responding is difficult, mainly when considering the different individual speech possibilities. Therefore, we decided to set the settings in RASA so that waiting for the patient’s answer is as long as it is needed (infinite time when required). Unfortunately, it was unclear during testing that this apparently straightforward solution would cause problems during real patient conversations. Had it not been for reports of assistants, we probably would not have known about such a problem. This is, in fact, a situation that disturbs and affects not only a single question-and-answer conversation with the patient but also the entire therapy session as a whole. Such situations may lead to unwanted interruptions and could likely prevent the dialogue system from functioning correctly.

  • Issue 2: Lack of Knowledge About Patient Behavior

Another common problem we have observed in the trials is shown in Fig. 5. As the dialogue progresses, at some point during the conversation, the patients are asked if they would like to participate in a relaxation exercise to help calm down their emotions. After a positive reply, Terabot proceeds with a relaxation exercise that lasts a few minutes. This exercise consists of therapeutic statements provided and recommended by psychiatrists. This is also the time for the patients to train their concentration. Unfortunately, in contrast to the domain of real human–human interaction, here, as the trials have shown, there is no way of knowing the patient’s current behavior. During Terabot’s utterances, no information is gathered, whether the patient was present and listened or maybe walked away. If the assistants were not present during this phase of the dialogue, we would not have any information about the patient’s behavior. Only at the end of the relaxation exercise is the patient asked about the feelings and whether this exercise was helpful.

Fig. 4
A graph of the RASA framework on a timeline. It plots points depicting the Terabot's asking. The microphone turns on after the Terabot's question.

Frequently occurring problem during conversations with the dialogue system: the microphone waiting infinitely for the patient’s answer (the vertical gray arrow indicates the decision of the next chosen utterance of the system)

Fig. 5
A graph of the RASA framework on a timeline. It plots points depicting the patient's answering. It is followed by Terabot's proposed C B T exercises.

Another frequently occurring problem during conversations with the dialogue system—no information on the patient’s state or presence during Terabot’s proposed relaxing exercise (the vertical gray arrow indicates the utterance going to the RASA framework)

For patients staying at the psychiatric hospital, the speech-to-speech dialogue agent is also designed to train their concentration. The psychiatrists who co-designed this part wanted a sequence of significant sentences without having the patients to answer. During this part, patients could focus on their inner thoughts to calm overwhelming emotions. Unfortunately, here in the relaxation exercise mode, our dialogue system does not receive any feedback from the patient, so there is no certainty that the patient is even in front of Terabot’s interface. The only way we can be sure of this is if the assistant sitting in the room confirms that the patient has been trying to follow the commands given during the exercise and trying to concentrate at all. Therefore, the performance of the dialogue system during the relaxation exercise needs to be improved.

Once again, during the experiments, we noticed different patient behaviors. These included, for instance, a group trying to concentrate and follow the instructions given to help them concentrate but also another group, e.g., looking away and thinking about anything else rather than coping with their feelings.

In our present dialogue system version, there are no means of contacting the agent other than voice commands. Our tests with patients have brought to light the need for more control or feedback on whether the patients were actually seated in front of the Terabot. It is crucial to gain information if the patients have been somehow participating in the relaxation exercise. This exposes the need for more than one modality for specialized dialogue agents, such as our Terabot.

5 Solutions by Using Eye-Tracker

In parallel to our dialogue agent, we used an eye-tracker to monitor the behavior of the patients’ eyes during the conversations. It was a stationary Gazepoint GP3 model with a selected frequency of 60 Hz. As our study involved psychiatric patients, for whom concentration is a significant challenge, the psychiatrists advised us not to use a mobile eye-tracker, as this could negatively affect the patients’ behavior during the conversations. When conducting an experiment with a GP3 eye-tracker, various parameters can serve as output signals. Table 1 contains a classification of output parameters divided by groups (general overview).

Table 1 General overview of the parameters obtained from the eye-tracker (using the example of Gazepoint GP3 eye-tracker)

Our study involved using the eye-tracking data, which was recorded during sessions and then analyzed later. We did not use data in real time in this study. If we are able to achieve an implementation of the eye-tracker signal to our dialogue system, we will conduct more experiments with data in real time.

  • Choosing the Suitable Parameter to Observe Patient Behavior

When considering implementing an eye-tracking signal for a feedback loop, it is important to determine which particular signal should be passed back (to the RASA framework). Researchers, especially in psychology or marketing, who are interested in where the participant is looking typically analyze parameters related somehow to fixations (Krejtz et al., 2023; Porta et al., 2013); in our case, this would be the parameters, e.g., FPOGX/Y. These are the screen points (X and Y-coordinates) where the subject’s gaze is focused. Unfortunately, using this parameter as feedback to the framework would not be desirable in our situation. Since this signal is responsible for the on-the-screen fixation information (in this case, either FPOGV or FPOGX/Y), it would provide information only where the patient was looking directly at the screen. Selecting this parameter as feedback would result in undesirable actions by the dialogue system. While having dialogues with our dialogue agent, there will not be 100% eye contact throughout the whole conversation, as this also does not happen during human–human conversations. The patient’s gaze may be focused (during conversation) on an object next to the monitor, providing no information about his fixation on the screen.

In our study, it is essential to determine whether the patient is present in front of Terabot’s interface, not whether the patient is looking at the screen. Therefore, we suggest that our dialogue system considers the parameters regarding the patient’s pupils, which are RPV/LPV or RPMM/LPMM. Even if the patient would not look at the Terabot’s interface itself, just somewhere away from the screen, there is a broad acting window in which the patient’s eyes (pupils) are still visible for the eye-tracker. Therefore, the pupil signal would be valid. Thanks to this approach, information could be sent back to the RASA framework indicating whether the patient is present.

  • Proposed Solution to Issue 1

Figure 6 illustrates the solution proposed by our research team using an eye-tracker. The system activates the microphone, analyzes the patient’s response, and verifies the presence of the patient’s eye-tracking signal. According to this information, a better-adjusted intent for Terabot’s utterance is chosen.

Fig. 6
A graph of the RASA framework on a timeline. It plots points on the timeline depicting the Terabot's asking. The microphone turns on after the Terabot's question. It is followed by Terabot's new intent.

Proposed solution to Issue 1: the vertical gray arrows indicate that an utterance has been selected for Terabot’s speech. The vertical green arrow shows that the eye-tracking signal can give additional feedback to the RASA framework. Based on this, the utterances chosen for the following action of Terabot might be appropriately changed

We evaluate four hypothetical situations:

  1. (1)

    Speech signal and eye-tracking signal provided: The patient answers, and the microphone receives the patient’s speech. In that case, providing feedback to the RASA by the eye-tracker signal would provide actual information that the patient is currently present in front of Terabot’s interface. That would indicate that the patient actively participates in the discussion. It is noteworthy that maintaining 100% eye data signal (eye contact) is unnecessary, as such a large number is also uncommon in human-to-human conversations. A smaller amount of data retrieved from the eye-tracker is sufficient to classify whether the patient is present. If the patient gazes at Terabot occasionally while thinking or talking, the dialogue system might receive enough data for confirmation. On the other hand, when the waiting time for the patient’s response takes too long, then initiation of another special utterance (acknowledging the patient) might be activated, e.g., utterance Long_no_gaze: “What are you thinking about? Should I repeat the question?”

  2. (2)

    Only speech signal provided: The patient answers, so the microphone receives the patient’s speech, but there is no signal from the eye-tracker. In this situation, another additional utterance could activated that can inquire about the patient’s condition, e.g., utterance Why_no_gaze: “Is something wrong? Why don’t you look at me at all?”

  3. (3)

    Only eye-tracking signal provided: The patient does not answer, but the eye-tracking signal is present (and going back to the RASA framework). In this situation, one could consider for how long Terabot should stand still without any additional response, considering that the patient is present in front of Terabot’s interface and may just need more time to focus on the answer. After some time, an utterance should be activated, e.g., utterance No_looking: “Is everything okay? Does something bother you? Should I maybe repeat the question?”

  4. (4)

    No speech or eye-tracking signal: The patient does not speak, and there is no speech signal from the microphone or signal from the eye-tracker. In this particular situation, an important utterance would be needed, which would put the whole dialogue into a hold-on-modus. For example, an utterance Are_you_there with the statement “Hello, are You there? Do you want to continue our conversation?” should be activated to find out if the patient is still in front of Terabot’s interface. Depending on the appearance of the eye-tracking signal or the patient’s speech, further actions should be considered, taking into consideration whether to receive a response or not.

  • Proposed Solution to Issue 2

Looking at the situation described in Fig. 5, additional information about the patient is missing that might somehow confirm the patient’s presence during the relaxation exercise. Our solution to this problem is to let the RASA framework receive input from the eye-tracker while the patient participates in the exercise. If the feedback signal contains information about the detected pupil diameter of the patient, this would be confirmation enough that the patient is indeed present in front of Terabot’s interface. We do not intend to consider 100% or even close to 100% of the eye-tracking data obtained as confirmation that the patient is present. A smaller amount of data sent to the RASA framework would be sufficient to confirm presence.

Then, RASA would be able to activate utterances depending on how the patient is behaving (understood here as the presence of the pupil signal received by the eye-tracker). For instance, if a patient consents to a relaxation exercise and the pupillary signal is recorded, various situations can be identified in which the dialogue system could enhance responsiveness (Fig. 7). The following scenarios could be considered:

  1. (1)

    The patient agrees to the exercise, and the system receives feedback on the pupil signal—this is confirmation that the patient is present in front of the Terabot interface and wants to participate in the agent’s relaxation exercise, during which one can focus on thoughts and emotions.

  2. (2)

    The patient agrees to the exercise, but the RASA system receives an indication that there is no signal from the patient’s pupil. In this situation, it can be assumed that the patient is not present in front of the screen or has closed eyes. As a result, another special utterance can be activated with an additional question or a request. This leaves room for the patient to respond to what is going on and for the system to adapt to the patient’s behavior.

  3. (3)

    The patient does not agree to the relaxation exercise—this situation is already handled in Terabot’s dialogue scenarios.

    Fig. 7
    A graph of the RASA framework on a timeline with 5 states. It plots points on the timeline including the patient's answering, the microphone turning on, and Terabot's new intent.

    Illustration of the proposed solution to Issue 2: Patient’s utterance (vertical gray arrow) going back to the RASA framework and additionally the eye-tracking signal (green arrow) as an additional input signal. On the basis of these two signals (speech in utterance and eye-tracking signal), a suitable utterance will be formed as a response of Terabot

6 Discussion

We have highlighted the probable advantages and dialogue system enhancements that could be achieved by integrating the eye-tracker module into the conversational agent. In our example of our CBT therapy-oriented dialogue agent, we are also aware of the potential difficulties that could arise from such an extension. The following challenges may arise:

  • Investing a considerable amount of effort in the developing part of the two systems: linking the eye-tracker module to the RASA system, taking into account pupillometric parameters.

  • The need to remodel the dialogue flow diagram in the RASA framework.

  • The need for expansion of the dialogue scenarios, which have to be accepted by psychiatrists.

Despite these challenges, we have shown that through the presence of the patient’s pupillary signal, we found a way in which the conversations with Terabot can potentially be enhanced. We suppose that may change the dialogue flow with a benefit to the patient since the dialogue system would be more sensitive to human behavior. We believe that implementing a multimodal dialogue agent for therapy purposes would improve conversational adaptability. Through this multimodal approach, the interaction between the patient and the dialogue agent might closely resemble human-to-human interactions and conversations as we know them from casual therapeutic sessions with specialists. That multimodal approach would make it possible to obtain an enhanced, unique agent for CBT therapy purposes.

7 Limitations

Having discussed the contribution and implications of the study, some shortcomings need to be critically considered. As for analyzing eye-tracking data from patients during a conversation, it is important to bear in mind that a patient usually moves a bit when sitting. Generally, to get as much quality data as possible, the patient’s head should be kept as still as possible during a study using a stationary eye-tracker. The psychiatric patients, on the other hand, did not have this requirement as it could have potentially compromised their comfort during therapy. Therefore, after the first analyses of collected eye-tracking data, we found many gaps due to the patient’s head movements (e.g., in the relaxation phase of the exercises). Note that in our pilot study, Terabot was used to support drug therapy for psychiatric patients with schizophrenia.

Another aspect is that we have not yet implemented the integration of the eye-tracker signal into our dialogue system. Therefore, we cannot say whether the proposed solutions are fully satisfactory. We cannot predict whether the proposed solutions for improvement by combining Terabot with the eye-tracker will be able to cope with all the challenges during therapy sessions.

8 Conclusion

As technology continues to evolve, multimodal fusion will become a standard part of human–machine interaction. The use of multimodality features in dialogue systems is more and more common (Amorese et al., 2022; Bailly et al., 2006; Bee et al., 2009). New technologies in the fields of psychiatry and psychology are being designed to support or supplement therapy in various ways (Carroll & Rounsaville, 2010; Stefaniak et al., 2019).

In this chapter, we have shown how a pupillary signal can be used as an additional source of information for a goal-oriented conversation agent. The problems we encountered during our study using only a single-modal dialogue agent were pointed out. To overcome them, we have demonstrated the solutions of using an eye-tracker in such situations. When the pupillary signal is used as input to the RASA framework of the dialogue system, it can modify the utterances given back to the patient more appropriately.

9 Further Work

In our case, the next step will be experimental testing to determine whether the proposed solutions are effective for our dialogue agent. After that, we consider analyzing the eye-tracking data during the whole relaxation exercise. We want to find out if the Terabot-guided dialogues do have an effect, e.g., on either the patient’s mental load or the achievement of the state of relaxation. If we can improve Terabot, we are planning to expand the research in two ways. First of all, in order to help people with different mental illnesses, we would also like to add other sets of exercises (emotions) that the patients could talk about. Second, expanding the research and testing the exercises on healthy people might also help persons in challenging circumstances. It could also help students struggling with strong negative emotions before an exam or a thesis defense.

We believe that although Terabot must not be a substitute for a visit with a psychiatrist, this dialogue agent can be helpful in managing to work through difficult situations in everyday life. As such assistive dialogue agents evolve, we believe that they can help patients and provide them much faster with therapeutic support.