A Multimodal Approach for Improving a Dialogue Agent for Therapeutic Sessions in Psychiatry

Gabor-Siatkowska, Karolina; Stefaniak, Izabela; Janicki, Artur

doi:10.1007/978-3-031-60049-4_22

Karolina Gabor-Siatkowska⁴,
Izabela Stefaniak⁵ &
Artur Janicki⁴

550 Accesses

Abstract

The number of people with mental health problems is increasing in today’s societies. Unfortunately, there are still not enough experts (psychiatrists, psychotherapists) available. To address this issue, our research team developed a goal-directed therapeutic dialogue system named Terabot to assist psychiatric patients. This system features a voice interface, enabling verbal communication between the patient and the dialogue agent in Polish. Utilizing the RASA framework, the dialogue system is enhanced with text-based emotion and intention recognition. This enables the dialogue system to react “empathically,” i.e., considering the patient’s emotions. The purpose of Terabot is to provide extra support for mental health patients who require additional therapy sessions due to limited access to medical personnel. This will not replace drug treatment but rather serve as additional therapy sessions. Our study consisted of therapy sessions of patients talking to Terabot, conducted at the Institute of Psychiatry and Neurology in Warsaw, Poland. During these sessions, we observed several issues that have led either to interrupting the therapeutic session or worsening the patient’s performance of the relaxation exercise. We suggest addressing these problems by implementing an eye-tracker in our dialogue system to make the dialogue flow more human-like. We propose a feedback loop in which the eye-tracker provides essential data back to the RASA framework. This gives additional information to the framework, and a more appropriate response can be given to the patient. Our main aim is to establish a feedback loop that will likely impact the way the conversation is conducted. Thanks to this, the dialogue system may perform better. As a result, the dialogue agent’s responses can be improved, resulting in a more natural, human-like flow of conversation.

You have full access to this open access chapter, Download chapter PDF

Keywords

1 Introduction

As the WHO reports, over the past few years, the number of patients waiting to be seen by psychiatrists has been growing constantly (the WHO highlights urgent need to transform mental health and mental health care, 2022). Unfortunately, waiting for a long time for an appointment or therapy is a disadvantage for psychiatric patients. As more people struggle with mental disorders, new solutions must be found to help patients. What could be helpful is the use of innovative technology to supplement the therapist’s work. Some therapies, like cognitive behavioral therapy (CBT) (Fenn & Byrne, 2013) or exposure techniques in virtual reality (Sherrill & Rothbaum, 2023), have a straightforward, well-defined, and organized structure that enables them to be integrated into technological tools efficient (Dino et al., 2019). One example could be using chatbots for psychological therapies, e.g., the chatbot Woebot, designed for English-speaking individuals, which has helped treat depression (Fitzpatrick et al., 2017). Another example is an avatar-based dialogue system, although with human supervision, effectively treating auditory hallucinations (Craig et al., 2018; Stefaniak et al., 2019). When talking about these dialogue agents, one must bear in mind the different input signals these systems might receive. To achieve a similar level of intuitive interaction between humans and computers, the use of multimodal interfaces is required (Duarte, 2007). From simple conversations to therapy sessions with specialists, human-to-human interaction is naturally multimodal, using speech, gesture, gaze, etc. In literature, multimodality is described as the combination of multiple senses and modes of communication, including sight, sound, print, images, video, and music, that make up a message (Dressman, 2019). To find better and more human-like results, multimodality is becoming one of the most important research directions to be followed. In the digital age, multimodality has become even more important for communication.

Multimodal fusion is a technique that uses a combination of different input modalities to enable systems to understand human behavior better. By combining multiple data sources, user intent can be interpreted more accurately and bring interaction closer to what users are used to from human-to-human interaction (Duarte & Carriço, 2006). As technology evolves, multimodal fusion will likely become an integral part of human–machine interaction. The popularity and frequency of multimodal capabilities are already increasing. One example could be the use of the eye-tracker in the field of human–computer interaction (Chandra et al., 2016; Majaranta & Bulling, 2014; Santini et al., 2017). Most eye-trackers use the near-infrared light spectrum to track eye movement interaction (Mulvey et al., 2008). The most important parameters obtained during an eye-tracking session include pupil diameter and eye movement parameters (saccades and fixations), among many others (e.g., number of blinks, blink duration, time to first fixation—TTFF) (Duchowski, 2017). Saccades are fast, simultaneous movements of both eyes between two or more fixation points. The choice of a particular parameter for analysis always depends on the purpose of the experiment being performed. Therefore, the eye-tracking signal has become of interest to a broader public, e.g., in psychology (Hershaw & Ettenhofer, 2018; Krejtz et al., 2018; Pfleging et al., 2016), education (Was et al., 2016), marketing (Białowąs & Szyszka, 2019; Wedel, 2014), medicine (Bartošová et al., 2018). For years, eye-trackers have also been used to study human–computer interaction. An example in this area is Embodied Conversational Agents (ECAs), which have received much attention. They use nonverbal behavior to establish contact with a human user (Bailly et al., 2006; Bee et al., 2009). For such a dialogue to become more reliable, these agents should be equipped with communicative and expressive abilities similar to those we know from human-to-human interaction (speech, gestures, facial expressions, gaze, etc.) (Bee et al., 2009). Increasingly, eye-trackers are used as an additional source of information during users’ conversations with a dialogue agent (Bailly et al., 2006; Bee et al., 2009). Thus, the data obtained are then used to support or control conversational agents. An example of dialogue agents that take into account the behavior of the participant’s eye movements in various aspects are: Gandalf (a humanoid agent), which can narrate planets and moons (Thórisson, 1997), an interactive storyteller Emma (Bee et al., 2010), or a dialogue agent who interacts with seniors (Amorese et al., 2022). Thanks to eye-tracking studies, it is possible to obtain several parameters characterizing, for example, a person’s emotional state (Bradley et al., 2008) and concentration (Chang & Chueh, 2019). The human–computer interaction that is carried out is more natural, and as a result, dialogue agents can interact more realistically with people. It is a perfect example of how multimodal fusion can create immersive user experiences.

Concerning the growing problem in the domain of healthcare, our research team has developed a goal-oriented dialogue system called Terabot, which implemented elements of CBT. Initial results from dialogues with seven psychiatric patients are presented in Gabor-Siatkowska et al. (2023a, 2023b). We observed the accuracy of speech and intent recognition and compared the results with those of healthy subjects. During the therapy sessions at the Institute of Psychiatry and Neurology in Warsaw, Poland, we additionally used the Gazepoint GP3 eye-tracker to collect eye-tracking data from psychiatric patients while they interacted with the dialogue system. In this chapter, we will describe the problems encountered during these dialogues between the patient and the agent. Furthermore, we will show how integrating the eye-tracking signal into the dialogue system may prevent errors and miscommunication. Our chapter is structured as follows: Sect. 1 briefly reviews related work in our research field. Next, in Sect. 2, we describe Terabot, the therapeutic spoken-dialogue system. In Sect. 3, the conducted experiments are described. In Sect. 4, we present the ongoing problems that occurred during conversations between the patients and our dialogue agent. Section 5 describes our proposed solutions for these problems. The chapter concludes with a discussion and further work.

2 Terabot—The “Empathetic” Dialogue System

As mentioned earlier, the number of psychiatrists is insufficient due to the growing mental health problems in today’s societies (the WHO highlights the urgent need to transform mental health and mental health care, June 17, 2022). As a response to this issue, to support psychiatric patients, we have developed “Terabot,” a goal-oriented therapeutic dialogue system. This speech-to-speech system operates in the Polish language. It is designed to meet the needs of psychiatric patients dealing with complex and overwhelming emotions. It can be used to help them recognize their feelings and behaviors in difficult situations. In each session, the patient can choose from three emotional topics such as anger, shame, and fear. At the end of the session, Terabot offers a relaxation exercise to help the patient to calm down. Importantly, and to benefit patients, all of Terabot’s responses have been reviewed and edited by psychiatrists. The dialogue flow has been designed according to psychiatrist recommendations using CBT therapy elements.

The Architecture of Terabot

Our goal was to minimize the risk of the robot making a wrong response, so we designed a goal-oriented dialogue system. Figure 1 shows a schematic diagram of the architecture of our conversational agent.

A flow diagram of a dialogue system. It has 2 databases, training data and a dialog model. The training database recognizes a patient's speech and a decision is made on the next action. — **Fig. 1**

When a patient starts speaking, an automatic speech recognition (ASR) system converts this speech to text (using GoogleWeb Speech API). This text is then further analyzed to identify the intents and slots for which the Dual Intent and Entity Transformer (DIET) classifier (Bunk et al., 2020), a multitask transformer, was used. It can handle both intent and entity recognitions. DIET uses a sequence model that considers the sequence of words. At the same time as the intent and slot recognition, the output of the ASR is passed to the text-based emotion recognition module (Zygadło et al., 2021). It is based on the Bidirectional Encoder Representations from Transformers (BERT) model with fine-tuning for emotion classification. The currently detected emotional state of the patient’s spoken text determines the value of the “emotion” slot. Another slot is filled with the exercise type. The recent further development of emotion and sentiment recognition using big data and deep learning methods used in our dialogue system has been reported in our article (Kozłowski et al., 2023). Our system uses RASA to implement the action decision pipeline and the DIET classifier. RASA is an open-source dialogue management and natural language understanding (NLU) framework. This framework uses a weighted combination of the memoization policy (e.g., based on stories stored in memory), a rule policy, and a Transformer Embedding Dialogue (TED) (Vlasov et al., 2019) to decide the next system action. The TED policy takes into account the current state of the dialogue (patient’s intent), slot values (including the recognized emotion), and previous states of the dialogue. If the next action is an utterance, it will be selected from the utterance database. The last step is to transform the selected text (utterance of Terabot) into a speech signal. The result is Terabot’s response given to the patient. Since Terabot is a domain-specific dialogue system, we faced difficulty acquiring large samples of real data to increase the training database. In our research (Gabor-Siatkowska et al., 2023a, 2023b), we present how to enlarge the dataset using another commonly known AI tool, such as ChatGPT. Our proposed solution enabled the increase of the dialogue system’s intent recognition accuracy of the spoken utterances by patients.

3 Sessions with Terabot

The therapy sessions with Terabot took place between March and August 2023 at the Institute of Psychiatry and Neurology in Warsaw, Poland. This study was conducted in accordance with the Declaration of Helsinki and approved by the Ethics Committee of the Institute of Psychiatry and Neurology in Warsaw. The first experiments with seven psychiatric patients talking to Terabot, which gave us 600 recordings, showed promising results (Gabor-Siatkowska et al., 2023a, 2023b). Most patients confirmed that Terabot understood their answers and emotions (around 3.5 on the Likert scale). All patients rated Terabot’s speech as high quality, natural, and fast and liked how it was presented.

In total, there were 38 participants between the ages of 18 and 65 who took part in our study. All participants were admitted to a 24-h psychiatric hospital. There, they have been examined and classified by psychiatrists for additional therapy sessions with Terabot based on their diagnosed conditions. These patients have been diagnosed with F20.0 to F20.9 (schizophrenia) according to ICD-10 (WHO, International Classification of Diseases). They were treated with medication, including antipsychotics, mostly in combination with other antidepressants or mood stabilizers. They were randomly selected for the study after meeting the inclusion criteria and signing a consent form. It was a randomized clinical trial with random assignment to different experimental conditions. Psychiatrists informed them about the purpose and conduct of the clinical trial and addressed any questions they might have had. Patients approved to record their voice and eye-tracking data. For conversational agents, which are designed for psychiatric patients, it is important that no image/video of the patient will be recorded during these conversations, as neither patients would consent nor would we get the bioethics committee approval. The anonymity of patients participating in the study was guaranteed.

In five sessions per week, this is one session per day, the patients talked to Terabot. Each conversation lasted for about 7–15 min. The patients could choose one of the three exercises: anger, fear, or shame, which could be then repeated or changed the next day. A fragment of such an exercise is shown in Fig. 2. The patients were sitting in front of Terabot, with an assistant invited to the same room. This was necessary because of the psychiatric institute’s requirements for the patient’s safety. Figure 3 shows a demonstration setup with a monitor and an eye-tracker at the bottom of the screen.

A block diagram of a conversation between the Terabot and a patient. The robot asks the patient if they have felt embarrassed in a situation. The patient responds with affirmation. — **Fig. 2**

A photograph of a woman watching a computer screen that displays a man in a mask. The display is kept on a desk. — **Fig. 3**

4 Frequent Problems During Dialogues

Despite the extensive testing, some unforeseen problems occurred during the therapeutic sessions. They caused some unintentional discomfort to the patients, sometimes leading to problems and misunderstandings during the conversation. As explained in Sect. 2, the RASA framework is responsible for sending follow-up utterances after receiving the appropriate information about the recognized intention of the patient’s speech. Even though we adapted RASA appropriately, as recommended by its guidelines, we found patterns of problematic issues. Some problems also occurred during the dialogues, which assistants have reported. All of these are described below.

Issue 1: Waiting Too Long for the Patient’s Answer

The first common problem is presented in Fig. 4. There have been frequently occurring situations in which the patient did not respond (to a question posed by Terabot) for an extended time. In these situations, there was no message from or to Terabot. It was also impossible for the assistants to ask the patient anything or to repeat the question of the dialogue agent—this would have interrupted the dialogue flow. The assistant did not know how the system would react to such a situation—if it would respond and remind the patient or hang up because it waited too long for the response.

After the conversation ended, the patients were asked about the long waiting time for their answers. Some of them responded that they were thinking about what the best answer to the question would be; others claimed that their thoughts were somewhere else, and it took them some time to come back to the conversation and give an answer. In all those situations, the dialogue system waited the whole time. It was not clear if the patient would eventually respond or not at all—they also might have left the room and the dialogue agent would still wait for an answer. Such situations had to be handled appropriately.

When it comes to the patient’s speech, this was a challenge for our dialogue system. The patients who took part in our trials differed in terms of velocity and complexity of their responses. Some patients answered quickly and superficially, sometimes with low speech volume. Others provided complete, brief answers without much explanation. Another group of patients responded verbosely, without waiting for Terabot to finish the question. Yet another group of patients raised their voices at Terabot, literally shouting, regardless of how extensive or complex their response was. There were also patients suffering from logorrhea. It is a communication disorder that causes excessive use of words and repetition, which can lead to incoherence. It is sometimes called pressure of speech (Karbe, 2014).

Long before the trials in the hospital, our research group experimented with the Terabot’s waiting-for-the-response mode. We tested how fast and for how long the ASR system should be active after Terabot’s statement/question is finished during conversations. Predicting how long a patient will wait before responding is difficult, mainly when considering the different individual speech possibilities. Therefore, we decided to set the settings in RASA so that waiting for the patient’s answer is as long as it is needed (infinite time when required). Unfortunately, it was unclear during testing that this apparently straightforward solution would cause problems during real patient conversations. Had it not been for reports of assistants, we probably would not have known about such a problem. This is, in fact, a situation that disturbs and affects not only a single question-and-answer conversation with the patient but also the entire therapy session as a whole. Such situations may lead to unwanted interruptions and could likely prevent the dialogue system from functioning correctly.

Issue 2: Lack of Knowledge About Patient Behavior

Another common problem we have observed in the trials is shown in Fig. 5. As the dialogue progresses, at some point during the conversation, the patients are asked if they would like to participate in a relaxation exercise to help calm down their emotions. After a positive reply, Terabot proceeds with a relaxation exercise that lasts a few minutes. This exercise consists of therapeutic statements provided and recommended by psychiatrists. This is also the time for the patients to train their concentration. Unfortunately, in contrast to the domain of real human–human interaction, here, as the trials have shown, there is no way of knowing the patient’s current behavior. During Terabot’s utterances, no information is gathered, whether the patient was present and listened or maybe walked away. If the assistants were not present during this phase of the dialogue, we would not have any information about the patient’s behavior. Only at the end of the relaxation exercise is the patient asked about the feelings and whether this exercise was helpful.

A graph of the RASA framework on a timeline. It plots points depicting the Terabot's asking. The microphone turns on after the Terabot's question. — **Fig. 4**

A graph of the RASA framework on a timeline. It plots points depicting the patient's answering. It is followed by Terabot's proposed C B T exercises. — **Fig. 5**

For patients staying at the psychiatric hospital, the speech-to-speech dialogue agent is also designed to train their concentration. The psychiatrists who co-designed this part wanted a sequence of significant sentences without having the patients to answer. During this part, patients could focus on their inner thoughts to calm overwhelming emotions. Unfortunately, here in the relaxation exercise mode, our dialogue system does not receive any feedback from the patient, so there is no certainty that the patient is even in front of Terabot’s interface. The only way we can be sure of this is if the assistant sitting in the room confirms that the patient has been trying to follow the commands given during the exercise and trying to concentrate at all. Therefore, the performance of the dialogue system during the relaxation exercise needs to be improved.

Once again, during the experiments, we noticed different patient behaviors. These included, for instance, a group trying to concentrate and follow the instructions given to help them concentrate but also another group, e.g., looking away and thinking about anything else rather than coping with their feelings.

In our present dialogue system version, there are no means of contacting the agent other than voice commands. Our tests with patients have brought to light the need for more control or feedback on whether the patients were actually seated in front of the Terabot. It is crucial to gain information if the patients have been somehow participating in the relaxation exercise. This exposes the need for more than one modality for specialized dialogue agents, such as our Terabot.

5 Solutions by Using Eye-Tracker

In parallel to our dialogue agent, we used an eye-tracker to monitor the behavior of the patients’ eyes during the conversations. It was a stationary Gazepoint GP3 model with a selected frequency of 60 Hz. As our study involved psychiatric patients, for whom concentration is a significant challenge, the psychiatrists advised us not to use a mobile eye-tracker, as this could negatively affect the patients’ behavior during the conversations. When conducting an experiment with a GP3 eye-tracker, various parameters can serve as output signals. Table 1 contains a classification of output parameters divided by groups (general overview).

Table 1 General overview of the parameters obtained from the eye-tracker (using the example of Gazepoint GP3 eye-tracker)

Full size table

Our study involved using the eye-tracking data, which was recorded during sessions and then analyzed later. We did not use data in real time in this study. If we are able to achieve an implementation of the eye-tracker signal to our dialogue system, we will conduct more experiments with data in real time.

Choosing the Suitable Parameter to Observe Patient Behavior

When considering implementing an eye-tracking signal for a feedback loop, it is important to determine which particular signal should be passed back (to the RASA framework). Researchers, especially in psychology or marketing, who are interested in where the participant is looking typically analyze parameters related somehow to fixations (Krejtz et al., 2023; Porta et al., 2013); in our case, this would be the parameters, e.g., FPOGX/Y. These are the screen points (X and Y-coordinates) where the subject’s gaze is focused. Unfortunately, using this parameter as feedback to the framework would not be desirable in our situation. Since this signal is responsible for the on-the-screen fixation information (in this case, either FPOGV or FPOGX/Y), it would provide information only where the patient was looking directly at the screen. Selecting this parameter as feedback would result in undesirable actions by the dialogue system. While having dialogues with our dialogue agent, there will not be 100% eye contact throughout the whole conversation, as this also does not happen during human–human conversations. The patient’s gaze may be focused (during conversation) on an object next to the monitor, providing no information about his fixation on the screen.

In our study, it is essential to determine whether the patient is present in front of Terabot’s interface, not whether the patient is looking at the screen. Therefore, we suggest that our dialogue system considers the parameters regarding the patient’s pupils, which are RPV/LPV or RPMM/LPMM. Even if the patient would not look at the Terabot’s interface itself, just somewhere away from the screen, there is a broad acting window in which the patient’s eyes (pupils) are still visible for the eye-tracker. Therefore, the pupil signal would be valid. Thanks to this approach, information could be sent back to the RASA framework indicating whether the patient is present.

Proposed Solution to Issue 1

Figure 6 illustrates the solution proposed by our research team using an eye-tracker. The system activates the microphone, analyzes the patient’s response, and verifies the presence of the patient’s eye-tracking signal. According to this information, a better-adjusted intent for Terabot’s utterance is chosen.

A graph of the RASA framework on a timeline. It plots points on the timeline depicting the Terabot's asking. The microphone turns on after the Terabot's question. It is followed by Terabot's new intent. — **Fig. 6**

We evaluate four hypothetical situations:

(1)
Speech signal and eye-tracking signal provided: The patient answers, and the microphone receives the patient’s speech. In that case, providing feedback to the RASA by the eye-tracker signal would provide actual information that the patient is currently present in front of Terabot’s interface. That would indicate that the patient actively participates in the discussion. It is noteworthy that maintaining 100% eye data signal (eye contact) is unnecessary, as such a large number is also uncommon in human-to-human conversations. A smaller amount of data retrieved from the eye-tracker is sufficient to classify whether the patient is present. If the patient gazes at Terabot occasionally while thinking or talking, the dialogue system might receive enough data for confirmation. On the other hand, when the waiting time for the patient’s response takes too long, then initiation of another special utterance (acknowledging the patient) might be activated, e.g., utterance Long_no_gaze: “What are you thinking about? Should I repeat the question?”
(2)
Only speech signal provided: The patient answers, so the microphone receives the patient’s speech, but there is no signal from the eye-tracker. In this situation, another additional utterance could activated that can inquire about the patient’s condition, e.g., utterance Why_no_gaze: “Is something wrong? Why don’t you look at me at all?”
(3)
Only eye-tracking signal provided: The patient does not answer, but the eye-tracking signal is present (and going back to the RASA framework). In this situation, one could consider for how long Terabot should stand still without any additional response, considering that the patient is present in front of Terabot’s interface and may just need more time to focus on the answer. After some time, an utterance should be activated, e.g., utterance No_looking: “Is everything okay? Does something bother you? Should I maybe repeat the question?”
(4)
No speech or eye-tracking signal: The patient does not speak, and there is no speech signal from the microphone or signal from the eye-tracker. In this particular situation, an important utterance would be needed, which would put the whole dialogue into a hold-on-modus. For example, an utterance Are_you_there with the statement “Hello, are You there? Do you want to continue our conversation?” should be activated to find out if the patient is still in front of Terabot’s interface. Depending on the appearance of the eye-tracking signal or the patient’s speech, further actions should be considered, taking into consideration whether to receive a response or not.

Proposed Solution to Issue 2

Looking at the situation described in Fig. 5, additional information about the patient is missing that might somehow confirm the patient’s presence during the relaxation exercise. Our solution to this problem is to let the RASA framework receive input from the eye-tracker while the patient participates in the exercise. If the feedback signal contains information about the detected pupil diameter of the patient, this would be confirmation enough that the patient is indeed present in front of Terabot’s interface. We do not intend to consider 100% or even close to 100% of the eye-tracking data obtained as confirmation that the patient is present. A smaller amount of data sent to the RASA framework would be sufficient to confirm presence.

Then, RASA would be able to activate utterances depending on how the patient is behaving (understood here as the presence of the pupil signal received by the eye-tracker). For instance, if a patient consents to a relaxation exercise and the pupillary signal is recorded, various situations can be identified in which the dialogue system could enhance responsiveness (Fig. 7). The following scenarios could be considered:

(1)
The patient agrees to the exercise, and the system receives feedback on the pupil signal—this is confirmation that the patient is present in front of the Terabot interface and wants to participate in the agent’s relaxation exercise, during which one can focus on thoughts and emotions.
(2)
The patient agrees to the exercise, but the RASA system receives an indication that there is no signal from the patient’s pupil. In this situation, it can be assumed that the patient is not present in front of the screen or has closed eyes. As a result, another special utterance can be activated with an additional question or a request. This leaves room for the patient to respond to what is going on and for the system to adapt to the patient’s behavior.
(3)
The patient does not agree to the relaxation exercise—this situation is already handled in Terabot’s dialogue scenarios.
Fig. 7
Illustration of the proposed solution to Issue 2: Patient’s utterance (vertical gray arrow) going back to the RASA framework and additionally the eye-tracking signal (green arrow) as an additional input signal. On the basis of these two signals (speech in utterance and eye-tracking signal), a suitable utterance will be formed as a response of Terabot
Full size image

6 Discussion

We have highlighted the probable advantages and dialogue system enhancements that could be achieved by integrating the eye-tracker module into the conversational agent. In our example of our CBT therapy-oriented dialogue agent, we are also aware of the potential difficulties that could arise from such an extension. The following challenges may arise:

Investing a considerable amount of effort in the developing part of the two systems: linking the eye-tracker module to the RASA system, taking into account pupillometric parameters.
The need to remodel the dialogue flow diagram in the RASA framework.
The need for expansion of the dialogue scenarios, which have to be accepted by psychiatrists.

Despite these challenges, we have shown that through the presence of the patient’s pupillary signal, we found a way in which the conversations with Terabot can potentially be enhanced. We suppose that may change the dialogue flow with a benefit to the patient since the dialogue system would be more sensitive to human behavior. We believe that implementing a multimodal dialogue agent for therapy purposes would improve conversational adaptability. Through this multimodal approach, the interaction between the patient and the dialogue agent might closely resemble human-to-human interactions and conversations as we know them from casual therapeutic sessions with specialists. That multimodal approach would make it possible to obtain an enhanced, unique agent for CBT therapy purposes.

7 Limitations

Having discussed the contribution and implications of the study, some shortcomings need to be critically considered. As for analyzing eye-tracking data from patients during a conversation, it is important to bear in mind that a patient usually moves a bit when sitting. Generally, to get as much quality data as possible, the patient’s head should be kept as still as possible during a study using a stationary eye-tracker. The psychiatric patients, on the other hand, did not have this requirement as it could have potentially compromised their comfort during therapy. Therefore, after the first analyses of collected eye-tracking data, we found many gaps due to the patient’s head movements (e.g., in the relaxation phase of the exercises). Note that in our pilot study, Terabot was used to support drug therapy for psychiatric patients with schizophrenia.

Another aspect is that we have not yet implemented the integration of the eye-tracker signal into our dialogue system. Therefore, we cannot say whether the proposed solutions are fully satisfactory. We cannot predict whether the proposed solutions for improvement by combining Terabot with the eye-tracker will be able to cope with all the challenges during therapy sessions.

8 Conclusion

As technology continues to evolve, multimodal fusion will become a standard part of human–machine interaction. The use of multimodality features in dialogue systems is more and more common (Amorese et al., 2022; Bailly et al., 2006; Bee et al., 2009). New technologies in the fields of psychiatry and psychology are being designed to support or supplement therapy in various ways (Carroll & Rounsaville, 2010; Stefaniak et al., 2019).

In this chapter, we have shown how a pupillary signal can be used as an additional source of information for a goal-oriented conversation agent. The problems we encountered during our study using only a single-modal dialogue agent were pointed out. To overcome them, we have demonstrated the solutions of using an eye-tracker in such situations. When the pupillary signal is used as input to the RASA framework of the dialogue system, it can modify the utterances given back to the patient more appropriately.

9 Further Work

In our case, the next step will be experimental testing to determine whether the proposed solutions are effective for our dialogue agent. After that, we consider analyzing the eye-tracking data during the whole relaxation exercise. We want to find out if the Terabot-guided dialogues do have an effect, e.g., on either the patient’s mental load or the achievement of the state of relaxation. If we can improve Terabot, we are planning to expand the research in two ways. First of all, in order to help people with different mental illnesses, we would also like to add other sets of exercises (emotions) that the patients could talk about. Second, expanding the research and testing the exercises on healthy people might also help persons in challenging circumstances. It could also help students struggling with strong negative emotions before an exam or a thesis defense.

We believe that although Terabot must not be a substitute for a visit with a psychiatrist, this dialogue agent can be helpful in managing to work through difficult situations in everyday life. As such assistive dialogue agents evolve, we believe that they can help patients and provide them much faster with therapeutic support.

References

Amorese, T., et al. (2022). Using eye tracking to investigate interaction between humans and virtual agents. In Proceedings—2022 IEEE international conference on cognitive and computational aspects of situation management, CogSIMA 2022 (pp. 125–132). Institute of Electrical and Electronics Engineers Inc. https://doi.org/10.1109/CogSIMA54611.2022.9830686
Bailly, G., et al. (2006). Embodied conversational agents: Computing and rendering realistic gaze patterns. In Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) (pp. 9–18). Springer Verlag. https://doi.org/10.1007/11922162_2
Bartošová, O., et al. (2018). Pupillometry as an indicator of l-DOPA dosages in Parkinson’s disease patients. Journal of Neural Transmission, 125(4), 699–703. https://doi.org/10.1007/s00702-017-1829-1
Article Google Scholar
Bee, N., et al. (2010) Discovering eye gaze behavior during human-agent conversation in an interactive storytelling application. In International conference on multimodal interfaces and the workshop on machine learning for multimodal interaction, ICMI-MLMI 2010 (pp. 1–8). ACM. https://doi.org/10.1145/1891903.1891915
Bee, N., André, E., & Tober, S. (2009) Breaking the ice in human-agent communication: Eye-gaze based initiation of contact with an embodied conversational agent. In Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics). https://doi.org/10.1007/978-3-642-04380-2_26
Białowąs, S., & Szyszka, A. (2019). Eye-tracking in marketing research. In Managing economic innovations—Methods and instruments (pp. 91–104). Bogucki Wydawnictwo Naukowe. https://doi.org/10.12657/9788379862771-6
Bradley, M. M., et al. (2008). The pupil as a measure of emotional arousal and autonomic activation. Psychophysiology, 45(4). https://doi.org/10.1111/j.1469-8986.2008.00654.x
Bunk, T., et al. (2020). DIET: Lightweight language understanding for dialogue systems. http://arxiv.org/abs/2004.09936. Accessed November 25, 2023.
Carroll, K. M., & Rounsaville, B. J. (2010). Computer-assisted therapy in psychiatry: Be brave-its a new world. Current Psychiatry Reports, 426–432. https://doi.org/10.1007/s11920-010-0146-2
Chandra, S., et al. (2016). Eye tracking based human computer interaction: Applications and their uses. In Proceedings—2015 international conference on man and machine interfacing, MAMI 2015. Institute of Electrical and Electronics Engineers Inc. https://doi.org/10.1109/MAMI.2015.7456615
Chang, K. M., & Chueh, M. T. W. (2019). Using eye tracking to assess gaze concentration in meditation. Sensors (Switzerland), 19(7). https://doi.org/10.3390/s19071612
Craig, T. K., et al. (2018). AVATAR therapy for auditory verbal hallucinations in people with psychosis: A single-blind, randomised controlled trial. The Lancet Psychiatry, 5(1), 31–40. https://doi.org/10.1016/S2215-0366(17)30427-3
Article Google Scholar
Dino, F., et al. (2019). Delivering cognitive behavioral therapy using a conversational SocialRobot. In IEEE international conference on intelligent robots and systems (pp. 2089–2095). http://arxiv.org/abs/1909.06670. Accessed February 18, 2024.
Dressman, M. (2019). Multimodality and language learning. In M. Dressman & R.W. Sadler (Eds.), The handbook of informal language learning. https://doi.org/10.1002/9781119472384.ch3
Duarte, C. (2007) Design and evaluation of adaptive multimodal systems. Universidade de Lisboa.
Google Scholar
Duarte, C., & Carriço, L. (2006). A conceptual framework for developing adaptive multimodal applications. In International conference on intelligent user interfaces, proceedings IUI (pp. 132–139). ACM. https://doi.org/10.1145/1111449.1111481
Duchowski, A. T. (2017) Eye tracking: Methodology theory and practice. Springer International Publishing AG.
Google Scholar
Fenn, K., & Byrne, M. (2013). The key principles of cognitive behavioural therapy. InnovAiT: Education and Inspiration for General Practice, 6(9), 579–585. https://doi.org/10.1177/1755738012471029
Fitzpatrick, K. K., Darcy, A., & Vierhile, M. (2017). Delivering cognitive behavior therapy to young adults with symptoms of depression and anxiety using a fully automated conversational agent (Woebot): A randomized controlled trial. JMIR Mental Health, 4(2). https://doi.org/10.2196/mental.7785
Gabor-Siatkowska, K., Sowański, M., et al. (2023a). AI to Train AI: Using ChatGPT to improve the accuracy of a therapeutic dialogue system. Electronics, 12(22), 4694. https://doi.org/10.3390/electronics12224694
Article Google Scholar
Gabor-Siatkowska, K., Sowanski, M., et al. (2023b). Therapeutic spoken dialogue system in clinical settings: Initial experiments. In International conference on systems, signals, and image processing. IEEE Computer Society. https://doi.org/10.1109/IWSSIP58668.2023.10180265
Hershaw, J. N., & Ettenhofer, M. L. (2018). Insights into cognitive pupillometry: Evaluation of the utility of pupillary metrics for assessing cognitive load in normative and clinical samples. International Journal of Psychophysiology, 134, 62–78. https://doi.org/10.1016/j.ijpsycho.2018.10.008
Article Google Scholar
Karbe, H. (2014). Wernicke’s area. In Encyclopedia of the neurological sciences (pp. 751–752). Elsevier Inc. https://doi.org/10.1016/B978-0-12-385157-4.01189-1
Kozłowski, M., et al. (2023). Enhanced emotion and sentiment recognition for empathetic dialogue system using Big Data and deep learning methods (pp. 465–480). Springer. https://doi.org/10.1007/978-3-031-35995-8_33
Krejtz, K., et al. (2018). Eye tracking cognitive load using pupil diameter and microsaccades with fixed gaze. PLOS ONE, 13(9), e0203629. https://doi.org/10.1371/journal.pone.0203629
Krejtz, K., et al. (2023). A unified look at cultural heritage: Comparison of aggregated scanpaths over architectural artifacts. Proceedings of the ACM on human-Computer Interaction, 7(ETRA), 1–17. https://doi.org/10.1145/3591138
Majaranta, P., & Bulling, A. (2014). Eye Tracking and eye-based human–computer interaction (pp. 39–65). Springer. https://doi.org/10.1007/978-1-4471-6392-3_3
Mulvey, F., et al. (2008). Exploration of safety issues in Eyetracking. COGAIN EU Network of Excellence.
Google Scholar
Pfleging, B., et al. (2016). A model relating pupil diameter to mental workload and lighting conditions. In Conference on human factors in computing systems—proceedings (pp. 5776–5788). Association for Computing Machinery. https://doi.org/10.1145/2858036.2858117
Porta, M., Ravarelli, A., & Spaghi, F. (2013). Online newspapers and ad banners: An eye tracking study on the effects of congruity. Online Information Review, 37(3), 405–423. https://doi.org/10.1108/OIR-01-2012-0001
Article Google Scholar
Santini, T., Fuhl, W., & Kasneci, E. (2017). CalibMe: Fast and unsupervised eye tracker calibration for gaze-based pervasive human-computer interaction. In Conference on human factors in computing systems—proceedings (pp. 2594–2605). Association for Computing Machinery. https://doi.org/10.1145/3025453.3025950
Sherrill, A. M., & Rothbaum, B. O. (2023). Virtual reality exposure therapy. In Encyclopedia of mental health (3rd ed., Vol. 1–3, pp. V3-592–V3-600). Elsevier. https://doi.org/10.1016/B978-0-323-91497-0.00023-0
Stefaniak, I., et al. (2019). Therapy based on avatar-therapist synergy for patients with chronic auditory hallucinations: A pilot study. Schizophrenia Research, 115–117. https://doi.org/10.1016/j.schres.2019.05.036.
Thórisson, K. R. (1997). Gandalf. In Proceedings of the first international conference on Autonomous agents—AGENTS ’97 (pp. 536–537). ACM Press. https://doi.org/10.1145/267658.267823
Vlasov, V., Mosig, J. E. M., & Nichol, A. (2019). Dialogue transformers. http://arxiv.org/abs/1910.00486. Accessed November 25, 2023.
Was, C., Sansosti, F., & Morris, B. (2016). Eye-tracking technology applications in educational research. In Eye-tracking technology applications in educational research. IGI Global. https://doi.org/10.4018/978-1-5225-1005-5.
Wedel, M. (2014) Attention research in marketing: A review of eye tracking studies. SSRN Electronic Journal [Preprint]. https://doi.org/10.2139/ssrn.2460289
WHO highlights urgent need to transform mental health and mental health care. (n.d.). https://www.who.int/news/item/17-06-2022-who-highlights-urgent-need-to-transform-mental-health-and-mental-health-care. Accessed February 18, 2024.
Zygadło, A., Kozłowski, M., & Janicki, A. (2021). Text-based emotion recognition in English and Polish for therapeutic chatbot. Applied Sciences (Switzerland), 11(21). https://doi.org/10.3390/app112110146

Download references

Acknowledgements

This research was funded by the Center for Priority Research Area Artificial Intelligence and Robotics of the Warsaw University of Technology within the Excellence Initiative: Research University (IDUB) program. The study was approved on April 27, 2022, by the Institute of Psychology and Neurology Ethics Committee in Warsaw, Poland; resolution No. IV/2022. This chapter is based upon work from COST Action CA19142—Leading Platform for European Citizens, Industries, Academia and Policymakers in Media Accessibility (LEAD-ME) supported by COST (European Cooperation in Science and Technology.

Author information

Authors and Affiliations

Faculty of Electronics and Information Technology, Warsaw University of Technology, Warsaw, Poland
Karolina Gabor-Siatkowska & Artur Janicki
Faculty of Medicine, Lazarski University, Warsaw, Poland
Izabela Stefaniak

Authors

Karolina Gabor-Siatkowska
View author publications
You can also search for this author in PubMed Google Scholar
Izabela Stefaniak
View author publications
You can also search for this author in PubMed Google Scholar
Artur Janicki
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Karolina Gabor-Siatkowska .

Editor information

Editors and Affiliations

School of English, Irish, and Communication, University of Limerick, Limerick, Ireland
Ann Marcus-Quinn
Eye Tracking Research Center, Institute of Psychology, SWPS University, Warsaw, Poland
Krzysztof Krejtz
LASIGE, Faculty of Sciences, University of Lisbon, Lisbon, Portugal
Carlos Duarte

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Gabor-Siatkowska, K., Stefaniak, I., Janicki, A. (2024). A Multimodal Approach for Improving a Dialogue Agent for Therapeutic Sessions in Psychiatry. In: Marcus-Quinn, A., Krejtz, K., Duarte, C. (eds) Transforming Media Accessibility in Europe. Springer, Cham. https://doi.org/10.1007/978-3-031-60049-4_22

Download citation

DOI: https://doi.org/10.1007/978-3-031-60049-4_22
Published: 20 August 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-60048-7
Online ISBN: 978-3-031-60049-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics