1 Introduction

The number of stroke patients, especially in the elderly population is increasing significantly worldwide [8]. The disabilities after a stroke can be greatly reduced with intensive and frequent training [13]. Therefore more therapies and therapists are needed to support the neuro-rehabilitation of the patients. Due to the shortage of therapists, an optimal therapy can no longer be guaranteed for all patients. This leads to a reduced quality of life and slower recovery of the patients [8].

Intelligent assistance systems can be used to support therapists at their work and patients with their post-stroke recovery [8]. In this paper we present an implemented text synthesis module that is used as a component in our intelligent assistance system to support the neuro-rehabilitation. An exemplary test setup of this system can be seen in Fig. 1. This text synthesis module generates the texts inside the speech bubble and is essential for the natural multi-modal interaction between robot and patient.

In the neuro-rehabilitation the therapists support their patients during the execution of several exercises. The exercises are designed to train the abilities a patient lost due to a stroke. Each exercise is first demonstrated by the therapists to show the correct execution. After that the patient repeats the exercise. The therapists assist and give feedback.

In our project, several therapy sessions were observed by experts and the structure of the interaction as well as the respective contents of a dialogue section were recorded in a neuro-rehabilitation therapy manual (NTM).

Our assistant system implements that NTM using a Moore machine with states for each part of the manual. Each time a state is entered a predefined content is shown on an output device (tablets integrated into pepper robot, additional tablet on the table for better readability) and uttered via voice output of the robot. This multi-modal content consists of multiple components:

  • Text, which is shown and read

  • Image / Video, which is displayed

  • Buttons / Input-field, in case of a question

To personalize the texts and adapt them to the current context as well as possible, a text synthesis is required.

In order to implement this text synthesis, we determined requirements that the generated texts and the system itself must meet. Afterwards a literature analysis was carried out in order to identify existing systems and concepts. Since no system that meets our requirements was found, we developed and evaluated our own system.

Fig. 1
figure 1

Illustration of our intelligent robotic assistance system and the interaction between a Pepper robot and a patient

2 Requirements

In close cooperation with our medical partners (neuro therapists and psychologists), we identified the following requirements for a text generation system in E-BRAiN:

  1. R1.

    Texts have to be user and content dependent.

  2. R2.

    Because this system is used in Germany, the text synthesis has to generate GermanFootnote 1 texts.

  3. R3.

    The content to be conveyed and the document structure is already given in a specified guide.

  4. R4.

    Texts have to be generated for different language levels, depending on the cognitive abilities of the patient.

  5. R5.

    The texts should be as natural as possible. The patient should not be able to distinguish between texts synthesised by our system and a text written by a human.

  6. R6.

    Full offline capability, due to the usage in a clinical environment.

  7. R7.

    Text synthesis in real time to be adapted to the current context.

Those requirements were identified during informal expert interviews with our medical project partners (N=5, medical doctors and nurses from neurology, psychology, and psychotherapy). They are quite specific for the addressed E-BRAiN project, and arose from the very determinate form of the interaction between therapist and patient.

3 State of the Art

We performed a systematic literature review to identify existing systems and concepts for text generation in a medical context. During the review, 22 papers were selected that met our search criteria. Table 1 shows a summary of the best tools we found with respect to our requirements.

Suter et al. have developed a system to simplify existing German texts into simple language [12]. To implement this, a rule-based procedure was used. It was found that the system is able to simplify texts. But, the system lacks the ability to simplify single difficult words or phrases, and in those situations a simplification by hand was found to be better.

Also other systems and concepts like the Penman system [7] for the English language and the extension by KOMET [5] for the German language were considered, but we classified them as unusable in our case, since no implementations are available. The same applies to the VIE-LANG system [2], which uses the semantic network SEMNET to synthesise texts as response to dialogue inputs.

Puzikov and Gurevych [9] used a template-based model and compared it with a neural model in order to take part in the E2E NLG Challenge. In this comparison, the template-based model prevailed due to its reliability and absence of errors. The template-based approach has also proven to be successful in other areas. For example, ontologies can be automatically translated into texts [1].

RosaeNLG [11], a node.js library based on the template engine Pug, implements a template-based approach. In addition to the German language, it also supports English, French, Italian and Spanish. Furthermore, it is completely offline-capable and already provides relevant functions for the natural language generation, such as the correct selection of an article, the correct conjugation of verbs, etc.

Table 1 Comparison of existing approaches wrt. our requirements

4 Implementation

Our system is built on top of RosaeNLG, extended with an explicit external dialog state. After introducing the general architecture, we will describe the following sub-problems to be solved in order to make the texts as natural as possible: synonymic alternatives, context dependencies, linguistic references, and language complexity levels.

4.1 System Architecture

As mentioned above, the general flow of the therapies is fixed by the NTM. Therefore, we only have to perform the context-adaptation, micro-planning and the surface realisation. We implemented the dialog structure using a Moore machine, with the automaton states corresponding to states of the NTM and transitions for each possible user interaction. To each of the automaton states a set of interactions (texts, images, ...) is associated, which are shown to the user while entering the state. We will refer to those multi-modal interactions as communicative act below. Fig. 3 shows a small part of such an automaton.

4.2 Synonymic Alternatives

To support diversity of the generated texts, we implemented different versions of a given communicative act as synonymic alternatives. This is already supported by RosaeNLG using, for instance, the built-in synz mechanism. Using this we can enable our system to generate texts in different forms, for example, switching between simple alternatives like “Hello”and “Welcome”, but also choose between larger synonymic statements like “You didn’t answer me”and “I did not get an answer from you”.

4.3 Context Dependencies

In our domain the generated texts have to be customised with respect to the following context information:

  1. (R1)

    Demographic information (name, age, ...)

  2. (R2)

    Medical information wrt. impairment (affected side, severity, ...)

  3. (R3)

    Cognitive abilities (language understanding, vision, ...)

  4. (R4)

    Performance during the exercises.

While the first context items (R1-3) are static for a given person, the item (R4) depends on information collected during the user interaction.

We implemented context-dependent text generation using the built-in functionality of RosaeNLG, in particular using If-Then, agreeAdj and the value-mixin statements as shown in Fig. 2.

Fig. 2
figure 2

A small RosaeNLG code sample for context-dependent text generation. Texts in typewriter font are output. The construct +agreeAdj generates the correct reference to the impaired hand

4.4 Linguistic References

Each state of the automaton corresponds to one communicative action. Sub-sequences of states, corresponding to sections of the NTM, are meant to convey a larger piece of information. To make the interaction as natural as possible, we would like to use linguistic references within a single communicative action and also across the different states within a section. RosaeNLG allows the usage of anaphoric references while generating the texts, as long as the text is generated during a single invocation of the system, but has no mechanism to refer to texts generated in a previous invocation.

Using a linguistic (anaphoric) reference to refer to some information just presented makes the texts sound very natural. But if the time between introducing a concept and referring to it is to long, the user might not recall what is meant by anaphoric references like “it”and “this”. Unfortunately, we have no control over the amount of time spent in a state as this is controlled by the user. It might, for instance, be that the user needs a break for some reason.

Our dialogue follows the states of the underlying automaton, as shown in Fig. 3. The states are linked to each other by temporal or event-triggered transitions. Referring expressions over multiple states should be used whenver possible, but they should only be used if they are in a limited temporal scope. This is the case if the state is linked to other states, describing the same context and if a time span \(\Delta _t\) is below a defined amount of seconds. In Fig. 3, the temporal scope of referring expressions is illustrated.

Fig. 3
figure 3

A small fragment of the automaton. 7 states are shown with corresponding state_ids (ag). The referring objects are underlined

RosaeNLG allows anaphoric references within a single call. In our case, however, the text for a given state of the automaton can only be generated while entering the particular state, therefore, we could not use the mechanism provided by RosaeNLG. To solve this problem we introduced an external dialogue memory. With this memory it is possible to generate the texts of several states jointly, split them into chunks corresponding to the states and to save them individually. One section consists of multiple state_ids and implements the idea of the temporal scope of referring expressions. With this technique we are able to use the same referring expressions in multiple state_ids.

In case our system receives a request to generate a text of a given state_id, it checks to which section the state_id belongs. The texts of successor state_ids of the section are synthesised and stored in our external dialogue memory. This external memory is implemented as a cache with an adaptive timeout. The entries will be removed as soon as the time span after introducing some concept exceeds a user dependent time out.

After that time span the concept that is referenced, must be explicitly renamed. If the time out was set to 3 seconds in Fig. 3, the following text is generated in state d: “The exercises will help you, to reach your goal, to be able to brush your teeth again.”.

4.5 Language Complexity Levels

To adapt the texts to the current patients abilities, we introduced the concept of language levels (R4) into our system. Before starting the therapy, various tests are carried out in which the physical and cognitive abilities of a patient are determined. Based on these values, a language level is chosen for each patient. Currently, texts can be generated in normal and simplified German, but other levels are possible as well. Simple language is often used to communicate with people with limited cognitive skills. For this purpose, all texts were simplified and checked with online tools for simplified language [6] and [10]. The language complexity level is added as a dynamic context information that can be adjusted and evaluated as sketched in Section 5.3.

This concept can also be used to include other types of language levels. For example, people with left brain impairment have limited processing of logical and mathematical statements. The other way around, people with impairment of the right hemisphere can process these statements very well [3]. So, depending on the patient, logical/mathematical or creative texts can be used to communicate with him/her.

4.6 Debugging Functionality

As texts in medical settings must be correct we integrated a debugging functionality into our system. We extended RosaeNLG and our dialog engine such that we are able to generate all possible paths through the dialog including all possible verbalisations of the texts. All texts including further multimedial content is exported into a PDF with a single page for each state. Transitions between states are integrated as hyperlinks between the corresponding pages. This allows medical experts to check the correctness and dynamics of the dialogues.

5 Evaluation

To evaluate our system we tested three different aspects: speed of the generation, correct context adaptation, and naturalness of the generated texts. In total,

To measure the speed we generated all texts for one of our implemented therapies (total 348 states) and measured the time between request and response (including network overhead). The mean time span was 11.67 ms, with a minimum of 4.97 ms, and maximum of 99 ms), which meets requirement R7.

To evaluate the correct adaption to a given patient, we generated a PDF-file that contains some of the texts (in total about 100 sentences) and asked a domain expert to verify them wrt. language quality and correct context adaption. To check the reliability of this method, we introduced some mistakes on purpose. For example, instructions for individual exercises were mixed up, as well as incorrect context information such as the side (left or right) of the patient to be treated. The expert only marked our forced mistakes, but no other problems. This shows that the system adapted all texts correctly to the patient and requirements R1-4 and R5 are satisfied. To further evaluate the correctness and naturalness of the generated texts, we used an online-tool [4] that checks the German spelling and grammar. To verify the simple language we used the tools mentioned above. The checkers detected some minor improvements (e.g., small spelling mistakes in medical terms and swapped letters) which were integrated into the system.

6 Conclusion

In this paper, we describe a tool which is able to generate user and content adapted texts for predefined content in a medical domain. After identifying necessary requirements, we performed a systematic literature review and designed a system based on RosaeNLG.

To meet all requirements for our application domain, we extended RosaeNLG with an external dialog memory. We evaluated the performance and correctness of the system. In total, we have implemented 4 different neuro-rehabilitation therapies, which will be evaluated in a clinical trial in the upcoming months.