1 Introduction

Rising patient numbers, paperwork, and a declining workforce are leading to tense working schedules in healthcare. In Germany, the average time a physician spends with each patient is reported to be about seven and a half minutes [1]. One of the many problems arising from these issues concerns the communication between patient and physician. Even though it is proven, that effective physician–patient communication leads to positive health outcomes [2], a stressful environment leads to reduced empathy [3] and ultimately reduces the quality of care which is provided, as empathy is a key-factor in effective patient encounters [4]. Furthermore, the rise of acceptance regarding AI-aided diagnoses [5], digitalization, and the promise of more accurate diagnoses and better therapeutic outcomes [6] mean that “soft factors” like empathy or stress management are increasingly neglected in medicine. The project “AudEeKA”, short for “Auditive Emotionserkennung für empathische KI-Assistenten” (eng..: auditive emotion recognition for empathic AI-Assistants) aims to automate assessment of affective human states. One use case is supporting medical staff in estimating emotional and situational health of patients. This could help to indicate to the doctor when more time is needed with the patient, e.g., when the patient is in a state of emotional distress. Another use case is remote medical care, where assessing the emotional state of patients is more difficult, because a lacking or distorted overall impression of the patient can be a side effect of using technical possibilities for communication. The assessment of the emotional state becomes especially important, when psychological and medical care cannot be provided by telemedicine or even on-site, as it is the case in long time manned space flight. Because of the increased difficulty related to usage in space and on highly selected persons, it can be considered as a separate, but most important use-case, as the success of a mission can be coupled with emotional health.

Table 1 Multimodal speech emotion recognition approaches

The end-goal of “AudEeKA” is to reliable recognize emotions continuously in humans, by collecting multimodal, physiological signals. As the use cases require portable, edge computing devices or at least a solution which is able to run in environments which have little computing power and energy supply, this project will mainly look into solutions running on lightweight, low resource-devices. Furthermore, the computation should be fast enough to work in real time, to be used live in a conversation. By this the result could be an empathic AI-assistant, which classifies emotions and stress continuously in various environments.

To reach this goal, firstly speech signals should be used to classify emotions. After this first step, it is planned to combine emotion recognition with the stress-classification by using biosignals and maybe speech. Both, emotions and stress, should be traced, while possibly using and watching their synergy. Later on, more bio-signals should be included to gain accuracy in regards of classifying emotions. All of these sub-goals should naturally consider the resource consumption and computing time of the built algorithms and classificators.

2 Related Work

Emotion recognition (ER) is a topic of long-standing interest and has been researched in human–computer interaction [13], affective robotics [14], E-learning [15], automotive safety [16], customer-care [17] or healthcare [18].

Currently, most approaches for ER are based on speech-, text-, visual-, or physiological data [19]. Those who combine modalities, usually do so between speech and text (e.g., [20,21,22]), or speech and visual data (e.g., [23,24,25]). A good overview is given by the paper of Imani and Montazer, which, lists both unimodal and multimodal approaches [26]. There is not much existing work in exploiting the potential of combining speech with other biosignals for ER (like Blood-Volume-Pulse (BVP), Electromyography (EMG), Skin Conductance (SC), Respiration (RSP), Body temperature (Temp), Electroencephalography(EEG), Electrocardiogram (ECG), Heart Rate Variability (HRV), Electrodermal Activity (EDA)), even though literature advises otherwise [27]. Found approaches can be seen in Table 1. Looking at Table 1 it also gets apparent, that a lot of the found approaches use more computaionally expensive methods: While a Support Vector Machine (SVM), k-Nearest-Neighbours (k-NN) and Linear Discriminant Analysis (LDA) are less costly, a Multi Layer Perceptron (MLP), Long Short Time Memory (LSTM), Deep Belief Networf (DBN), Probabilistic Neural Network (PNN), Deep Neural Network (DNN), Temporal Convolutional Neural Network (TCN) or Extreme Learning Machine (ELM) are computational more expensive. Therefore, these last methods are not applicable to “AudEeKA”. It should be noted that Table 1 does not assert a claim for completeness. Table 1 also shows, that many approaches take visual signals, like features from face or posture, into account, which is often seen as a privacy-issue.

In order for all solutions for ER to work, annotated datasets are needed. A good overview of speech-datasets is given by Wani, Gunawan, Qadri, Kartiwi and Ambikairajah [28] or the Technical University of Munich [29]. Although the papers list speech-emotion-datasets in large numbers, there are multiple difficulties, when one wants to build models which can be used in real-life: There are not only many acted datasets, which do translate poorly into real-world, but also manifold different labels (basic emotions of varying numbers or the valence-arousal-scale). The same problem can be observed in datasets, which use physiological signals (as listed by Shu et al. [30] or by Larradet et al. [31]). Albeit the problem in physiological datasets lies not so much in acted emotions, as they are rarely acted, but in the sources which cause the emotions and do not fully resemble the real world. For example a lot of studies use the international affective picture system [32], while others are experimenting with the use of virtual reality games (e.g., [33]) or -systems (e.g., [34]) or music (e.g., [35]). Analogous to the speech-based datasets, multiple labelling-strategies are found in physiological datasets.

Fig. 1
figure 1

Statistical values of MLP-classifier with emobase feature set on Emo-DB with LOOCV

Fig. 2
figure 2

Statistical values of MLP-classifier with Compare2016 feature set on Emo-DB with LOOCV

Fig. 3
figure 3

Averaged confusion matrix of MLP with emobase feature set on Emo-DB with LOOCV

Fig. 4
figure 4

Averaged confusion matrix of MLP with Compare2016 feature set on Emo-DB with LOOCV

3 Goals and Challenges

Considering the manifold use-cases in Sect. 1, but also the related work in Sect. 2 it becomes clear that the end-goal mentioned in Sect. 1 can be divided in sub-goals to be ultimately combined, but with pitfalls and difficulties to be considered.

3.1 Speech Emotion Recognition

While there are already a variety of of thoroughly described solutions that practice speech-ER (SER), most of them have shortcomings with regard to the requirements of “AudEeKA”. For instance there is a huge shift towards heavyweight deep-learning models, which cannot be considered in “AudEeKA”, because of the low-resource-setting. Other problems arise, when using traditional machine-learning approaches: For example feature-sets have to be carefully chosen, but not in a way that is too specific just for the used dataset. Since the expression of affect is dependant on factors like culture [36], gender [37] or age [38] and the end-goal mentioned in Sect. 1 includes that the solution is working on a wide range of people (especially important in telemedicine settings) and in various environments, while also the resource-restriction has to be considered and would favour a minimal feature set.

To get a first understanding of the performance of different feature sets but also on inter-individual differences, first test results on the Berlin Database of Emotional Speech (Emo-DB), a speech database with a set of basic but acted emotions [39] are on hand in “AudEeKA”. For classifications are done by a small MLP with hidden layer size of 340 and 32, logistic activation-function, maximum of 20 iterations and an initial learning rate of 0.0035 was applied. It was implemented with scikit-learn [40]. Parameters were choosen by experience. Used feature-sets were taken from Opensmile [41], namely emobase-functionals (988 features) and Compare2016-functionals (6373 features). For evaluation a Leave-One-Out-Crossvalidation (LOOCV) was used, in which the complete data of one subject was omitted as test-set in each iteration. With this LOOCV strategy, a more realistic test scenario was aimed for. In Figs. 1 and 2 Boxplots with different statistical evaluations of the applied LOOCV can be seen. One conclusion here can be, that while working with MLPs, bigger feature sets are resulting in better recognition-rates and lesser outliers than smaller ones. That feature sets affect not only overall recognition rates but also emotion-specific recognition rates, can be seen in Figs. 3 and 4. In Figs. 3 and 4 are the emotion specific confusion matrices depicted, with the mean recognition accuracy percentage calculated for every emotion. It is apparent that, in general, there are emotions that get mixed up more often than others. When looking at Figs. 1 and 2 it also becomes clear that the recognition value varies strongly from person to person. Ultimately it is unrealistic, that there is a “one-size-fits-all”-model, which works well on an individual in the wild. Not only the inter-individual differences are remarkable, there are also few datasets which record speech in the wild (also seen in Sect. 2).

Furthermore, the professional environment, as in manned space-flight or healthcare leaves little room for showing emotions. Oftentimes there are pre-existing speech-patterns, which further complicate the SER. To overcome these barriers, another sub-goal of the project is to create a new dataset. An additional sub-goal is to implement an approach that allows continuous learning, as it could help to eliminate inter-individual problems of (S)ER.

3.2 Stress Detection and Classification

Stress and emotions have undoubtedly a close relationship. Most emotion-models do not consider stress an emotion. This includes most theories which use the term “basic emotions” [42], or Russel’s circumplex of emotions [43], although the fact if a person is stressed, certainly is an essential context for evaluating the importance or severity of an expressed affect. Especially long-term stress, which “can potentially cause those cognitive, emotional and behavioral dysfunctions” [44]. To add context, it could be important to monitor stress. In the end, the detection of stress could not only be beneficial to the accuracy of SER, but also for other health-applications and a general monitoring of stress.

3.3 Emotion Detection from Physiological Signals

One lesson from Sect. 3.1 is that ER faces numerous challenges. To gain further insights and increase accuracy of the predictions using more input-modalities could be beneficial. At the moment a minimalistic use of modalities is anticipated for “AudEeKA”. Not depicted is a possible system or method for user input for continual learning and to retrain the model(s), since the exact strategy for continual learning is not fully elaborated. Because continual learning is considered crucial, it has to be thoroughly researched in literature and by own implementation of prototypes. The research and implementation of prototypes for continual learning can be considered part of the continual learning related sub-goal.

3.4 Combining All Approaches

In order to unite all planned classifications, namely SER, stress-classification and biophysiological-signal based ER, into one, fusion approaches must be applied. As there are currently to the authors knowledge no datasets, which fit the needs of this particular approach including not only stress, but also speech and other biosignals to ER, it is planned to train all models alone, on suitable datasets. In future work we plan to measure the performance of the fused models against the specifically created dataset. The planned architecture of the approach can be seen in Fig. 5. Different results from different classifications are planned to be taken into account for the other classifications, as there is maybe a synergy. To save space, the ER in Fig. 5 is depicted as one model and output. Generally, the design in Fig. 5 sets its main focus on feature-fusion. Due to the multiple types of sensors to be used, data-level-fusion would hardly be possible, as is usually aims to combine multiple homogenous data sources [45]. Furthermore, feature-level fusion has proven to be successfull in ER with other modalities (e.g., [46,47,48]). Since this is also the case for decision-level fusion (e.g., [48, 49] or [50]), the proposed architecture can easily be converted to fit an decision-level fusion approach. Also, the sub-goals defined in Sect. 1 are reflected in Fig. 5. It can be seen, that the tasks of emotion- and stress-classification are executed and combined, to obtain a better classification result.

Fig. 5
figure 5

Planned architecture of the emotion–recognition system

4 Conclusion

The Project “AudEeKA” is addressing a wide scope of challenges, but if executed correctly also major benefits for healthcare in general. This article describes only the first steps and provides summary insight into the initial considerations and the resulting future developments and research questions. To prove the suitability of the concepts presented here, a variety of implementations and tests must be performed on different datasets, including the creation of a dedicated dataset. Required tests mainly aim at the suitability for online usage and usage in the real world or in scenarios which come close to the use cases. This results in test-scenarios which involve various noisy backgrounds, people of different ages, gender and cultural backgrounds. The tests also have to consider fast but nevertheless accurate classification results and aim to obtain these classification results with minimal use of resources. To reach the best possible results, different classifiers, feature sets and combination possibilities (feature- or decision-fusion) have to be implemented and compared.