Abstract
This paper briefly introduces the Project “AudEeKA”, whose aim is to use speech and other bio signals for emotion recognition to improve remote, but also direct, healthcare. This article takes a look at use cases, goals and challenges, of researching and implementing a possible solution. To gain additional insights, the main-goal of the project is divided into multiple sub-goals, namely speech emotion recognition, stress detection and classification and emotion detection from physiological signals. Also, similar projects are considered and project-specific requirements stemming from use-cases introduced. Possible pitfalls and difficulties are outlined, which are mostly associated with datasets. They also emerge out of the requirements, their accompanying restrictions and first analyses in the area of speech emotion recognition, which are shortly presented and discussed. At the same time, first approaches to solutions for every sub-goal, which include the use of continual learning, and finally a draft of the planned architecture for the envisioned system, is presented. This draft presents a possible solution for combining all sub-goals, while reaching the main goal of a multimodal emotion recognition system.
Similar content being viewed by others
1 Introduction
Rising patient numbers, paperwork, and a declining workforce are leading to tense working schedules in healthcare. In Germany, the average time a physician spends with each patient is reported to be about seven and a half minutes [1]. One of the many problems arising from these issues concerns the communication between patient and physician. Even though it is proven, that effective physician–patient communication leads to positive health outcomes [2], a stressful environment leads to reduced empathy [3] and ultimately reduces the quality of care which is provided, as empathy is a key-factor in effective patient encounters [4]. Furthermore, the rise of acceptance regarding AI-aided diagnoses [5], digitalization, and the promise of more accurate diagnoses and better therapeutic outcomes [6] mean that “soft factors” like empathy or stress management are increasingly neglected in medicine. The project “AudEeKA”, short for “Auditive Emotionserkennung für empathische KI-Assistenten” (eng..: auditive emotion recognition for empathic AI-Assistants) aims to automate assessment of affective human states. One use case is supporting medical staff in estimating emotional and situational health of patients. This could help to indicate to the doctor when more time is needed with the patient, e.g., when the patient is in a state of emotional distress. Another use case is remote medical care, where assessing the emotional state of patients is more difficult, because a lacking or distorted overall impression of the patient can be a side effect of using technical possibilities for communication. The assessment of the emotional state becomes especially important, when psychological and medical care cannot be provided by telemedicine or even on-site, as it is the case in long time manned space flight. Because of the increased difficulty related to usage in space and on highly selected persons, it can be considered as a separate, but most important use-case, as the success of a mission can be coupled with emotional health.
The end-goal of “AudEeKA” is to reliable recognize emotions continuously in humans, by collecting multimodal, physiological signals. As the use cases require portable, edge computing devices or at least a solution which is able to run in environments which have little computing power and energy supply, this project will mainly look into solutions running on lightweight, low resource-devices. Furthermore, the computation should be fast enough to work in real time, to be used live in a conversation. By this the result could be an empathic AI-assistant, which classifies emotions and stress continuously in various environments.
To reach this goal, firstly speech signals should be used to classify emotions. After this first step, it is planned to combine emotion recognition with the stress-classification by using biosignals and maybe speech. Both, emotions and stress, should be traced, while possibly using and watching their synergy. Later on, more bio-signals should be included to gain accuracy in regards of classifying emotions. All of these sub-goals should naturally consider the resource consumption and computing time of the built algorithms and classificators.
2 Related Work
Emotion recognition (ER) is a topic of long-standing interest and has been researched in human–computer interaction [13], affective robotics [14], E-learning [15], automotive safety [16], customer-care [17] or healthcare [18].
Currently, most approaches for ER are based on speech-, text-, visual-, or physiological data [19]. Those who combine modalities, usually do so between speech and text (e.g., [20,21,22]), or speech and visual data (e.g., [23,24,25]). A good overview is given by the paper of Imani and Montazer, which, lists both unimodal and multimodal approaches [26]. There is not much existing work in exploiting the potential of combining speech with other biosignals for ER (like Blood-Volume-Pulse (BVP), Electromyography (EMG), Skin Conductance (SC), Respiration (RSP), Body temperature (Temp), Electroencephalography(EEG), Electrocardiogram (ECG), Heart Rate Variability (HRV), Electrodermal Activity (EDA)), even though literature advises otherwise [27]. Found approaches can be seen in Table 1. Looking at Table 1 it also gets apparent, that a lot of the found approaches use more computaionally expensive methods: While a Support Vector Machine (SVM), k-Nearest-Neighbours (k-NN) and Linear Discriminant Analysis (LDA) are less costly, a Multi Layer Perceptron (MLP), Long Short Time Memory (LSTM), Deep Belief Networf (DBN), Probabilistic Neural Network (PNN), Deep Neural Network (DNN), Temporal Convolutional Neural Network (TCN) or Extreme Learning Machine (ELM) are computational more expensive. Therefore, these last methods are not applicable to “AudEeKA”. It should be noted that Table 1 does not assert a claim for completeness. Table 1 also shows, that many approaches take visual signals, like features from face or posture, into account, which is often seen as a privacy-issue.
In order for all solutions for ER to work, annotated datasets are needed. A good overview of speech-datasets is given by Wani, Gunawan, Qadri, Kartiwi and Ambikairajah [28] or the Technical University of Munich [29]. Although the papers list speech-emotion-datasets in large numbers, there are multiple difficulties, when one wants to build models which can be used in real-life: There are not only many acted datasets, which do translate poorly into real-world, but also manifold different labels (basic emotions of varying numbers or the valence-arousal-scale). The same problem can be observed in datasets, which use physiological signals (as listed by Shu et al. [30] or by Larradet et al. [31]). Albeit the problem in physiological datasets lies not so much in acted emotions, as they are rarely acted, but in the sources which cause the emotions and do not fully resemble the real world. For example a lot of studies use the international affective picture system [32], while others are experimenting with the use of virtual reality games (e.g., [33]) or -systems (e.g., [34]) or music (e.g., [35]). Analogous to the speech-based datasets, multiple labelling-strategies are found in physiological datasets.
3 Goals and Challenges
Considering the manifold use-cases in Sect. 1, but also the related work in Sect. 2 it becomes clear that the end-goal mentioned in Sect. 1 can be divided in sub-goals to be ultimately combined, but with pitfalls and difficulties to be considered.
3.1 Speech Emotion Recognition
While there are already a variety of of thoroughly described solutions that practice speech-ER (SER), most of them have shortcomings with regard to the requirements of “AudEeKA”. For instance there is a huge shift towards heavyweight deep-learning models, which cannot be considered in “AudEeKA”, because of the low-resource-setting. Other problems arise, when using traditional machine-learning approaches: For example feature-sets have to be carefully chosen, but not in a way that is too specific just for the used dataset. Since the expression of affect is dependant on factors like culture [36], gender [37] or age [38] and the end-goal mentioned in Sect. 1 includes that the solution is working on a wide range of people (especially important in telemedicine settings) and in various environments, while also the resource-restriction has to be considered and would favour a minimal feature set.
To get a first understanding of the performance of different feature sets but also on inter-individual differences, first test results on the Berlin Database of Emotional Speech (Emo-DB), a speech database with a set of basic but acted emotions [39] are on hand in “AudEeKA”. For classifications are done by a small MLP with hidden layer size of 340 and 32, logistic activation-function, maximum of 20 iterations and an initial learning rate of 0.0035 was applied. It was implemented with scikit-learn [40]. Parameters were choosen by experience. Used feature-sets were taken from Opensmile [41], namely emobase-functionals (988 features) and Compare2016-functionals (6373 features). For evaluation a Leave-One-Out-Crossvalidation (LOOCV) was used, in which the complete data of one subject was omitted as test-set in each iteration. With this LOOCV strategy, a more realistic test scenario was aimed for. In Figs. 1 and 2 Boxplots with different statistical evaluations of the applied LOOCV can be seen. One conclusion here can be, that while working with MLPs, bigger feature sets are resulting in better recognition-rates and lesser outliers than smaller ones. That feature sets affect not only overall recognition rates but also emotion-specific recognition rates, can be seen in Figs. 3 and 4. In Figs. 3 and 4 are the emotion specific confusion matrices depicted, with the mean recognition accuracy percentage calculated for every emotion. It is apparent that, in general, there are emotions that get mixed up more often than others. When looking at Figs. 1 and 2 it also becomes clear that the recognition value varies strongly from person to person. Ultimately it is unrealistic, that there is a “one-size-fits-all”-model, which works well on an individual in the wild. Not only the inter-individual differences are remarkable, there are also few datasets which record speech in the wild (also seen in Sect. 2).
Furthermore, the professional environment, as in manned space-flight or healthcare leaves little room for showing emotions. Oftentimes there are pre-existing speech-patterns, which further complicate the SER. To overcome these barriers, another sub-goal of the project is to create a new dataset. An additional sub-goal is to implement an approach that allows continuous learning, as it could help to eliminate inter-individual problems of (S)ER.
3.2 Stress Detection and Classification
Stress and emotions have undoubtedly a close relationship. Most emotion-models do not consider stress an emotion. This includes most theories which use the term “basic emotions” [42], or Russel’s circumplex of emotions [43], although the fact if a person is stressed, certainly is an essential context for evaluating the importance or severity of an expressed affect. Especially long-term stress, which “can potentially cause those cognitive, emotional and behavioral dysfunctions” [44]. To add context, it could be important to monitor stress. In the end, the detection of stress could not only be beneficial to the accuracy of SER, but also for other health-applications and a general monitoring of stress.
3.3 Emotion Detection from Physiological Signals
One lesson from Sect. 3.1 is that ER faces numerous challenges. To gain further insights and increase accuracy of the predictions using more input-modalities could be beneficial. At the moment a minimalistic use of modalities is anticipated for “AudEeKA”. Not depicted is a possible system or method for user input for continual learning and to retrain the model(s), since the exact strategy for continual learning is not fully elaborated. Because continual learning is considered crucial, it has to be thoroughly researched in literature and by own implementation of prototypes. The research and implementation of prototypes for continual learning can be considered part of the continual learning related sub-goal.
3.4 Combining All Approaches
In order to unite all planned classifications, namely SER, stress-classification and biophysiological-signal based ER, into one, fusion approaches must be applied. As there are currently to the authors knowledge no datasets, which fit the needs of this particular approach including not only stress, but also speech and other biosignals to ER, it is planned to train all models alone, on suitable datasets. In future work we plan to measure the performance of the fused models against the specifically created dataset. The planned architecture of the approach can be seen in Fig. 5. Different results from different classifications are planned to be taken into account for the other classifications, as there is maybe a synergy. To save space, the ER in Fig. 5 is depicted as one model and output. Generally, the design in Fig. 5 sets its main focus on feature-fusion. Due to the multiple types of sensors to be used, data-level-fusion would hardly be possible, as is usually aims to combine multiple homogenous data sources [45]. Furthermore, feature-level fusion has proven to be successfull in ER with other modalities (e.g., [46,47,48]). Since this is also the case for decision-level fusion (e.g., [48, 49] or [50]), the proposed architecture can easily be converted to fit an decision-level fusion approach. Also, the sub-goals defined in Sect. 1 are reflected in Fig. 5. It can be seen, that the tasks of emotion- and stress-classification are executed and combined, to obtain a better classification result.
4 Conclusion
The Project “AudEeKA” is addressing a wide scope of challenges, but if executed correctly also major benefits for healthcare in general. This article describes only the first steps and provides summary insight into the initial considerations and the resulting future developments and research questions. To prove the suitability of the concepts presented here, a variety of implementations and tests must be performed on different datasets, including the creation of a dedicated dataset. Required tests mainly aim at the suitability for online usage and usage in the real world or in scenarios which come close to the use cases. This results in test-scenarios which involve various noisy backgrounds, people of different ages, gender and cultural backgrounds. The tests also have to consider fast but nevertheless accurate classification results and aim to obtain these classification results with minimal use of resources. To reach the best possible results, different classifiers, feature sets and combination possibilities (feature- or decision-fusion) have to be implemented and compared.
Data availability
The speech Database used in this paper is publicly available on the following page: http://www.emodb.bilderbar.info/download/.
References
Winnat C (2017) Deutsche aerzte nehmen sich rund sieben minuten zeit pro patient
Stewart MA (1995) Effective physician-patient communication and health outcomes: a review. CMAJ 152(9):1423
Nitschke JP, Bartz JA (2022) The association between acute stress & empathy: a systematic literature review. Neurosci Biobehav Rev 144:105003
Dugdale DC, Epstein R, Pantilat SZ (1999) Time and the patient–physician relationship. J Gen Intern Med 14:S34
Budde K, Dasch T, Kirchner E, Ohliger U, Schapranow M, Schmidt T, Schwerk A, Thoms J, Zahn T, Hiltawsky K (2020) Künstliche intelligenz: Patienten im fokus. Dtsch Arztebl 117(49):A–2407
Systeme LS-DPL (2019) Lernende systeme im gesundheitswesen: Grundlagen, anwendungsszenarien und gestaltungsoptionen. Bericht der AG Gesundheit, Medizintechnik, Pflege
Kim J, André E (2006) Emotion recognition using physiological and speech signal in short-term observation. In: Perception and interactive technologies: international tutorial and research workshop, PIT 2006 Kloster Irsee, Germany, June 19–21, 2006. Proceedings. Springer, pp 53–64
Chao L, Tao J, Yang M, Li Y, Wen Z (2015) Long short term memory recurrent neural network based multimodal dimensional emotion recognition. In: Proceedings of the 5th international workshop on audio/visual emotion challenge, pp 65–72
Ranganathan H, Chakraborty S, Panchanathan S (2016) Multimodal emotion recognition using deep learning architectures. In: 2016 IEEE winter conference on applications of computer vision (WACV), pp 1–9
Guo H, Jiang N, Shao D (2020) Research on multi-modal emotion recognition based on speech, eeg and ecg signals. In: Robotics and rehabilitation intelligence: first international conference, ICRRI 2020, Fushun, China, September 9–11, 2020, Proceedings, Part I 1. Springer, pp 272–288
Bakhshi A, Chalup S (2021) Multimodal emotion recognition based on speech and physiological signals using deep neural networks. In: Pattern recognition. ICPR international workshops and challenges: virtual event, January 10–15, 2021, Proceedings, Part VI. Springer, pp 289–300
Wang Q, Wang M, Yang Y, Zhang X (2022) Multi-modal emotion recognition using EEG and speech signals. Comput Biol Med 149:105907
Cowie R, Douglas-Cowie E, Tsapatsoulis N, Votsis G, Kollias S, Fellenz W, Taylor JG (2001) Emotion recognition in human–computer interaction. IEEE Signal Process Mag 18(1):32–80
Austermann A, Esau N, Kleinjohann L, Kleinjohann B (2005) Prosody based emotion recognition for mexi. In 2005 IEEE/RSJ international conference on intelligent robots and systems. IEEE, pp 1138–1144
Altun H (2005) Integrating learner’s affective state in intelligent tutoring systems to enhance e-learning applications. GETS 2005 3(1)
Lisetti CL, Nasoz F (2004) Using noninvasive wearable computers to recognize human emotions from physiological signals. EURASIP J Adv Signal Process 2004:1–16
Devillers L, Lamel L, Vasilescu I (2003) motion detection in task-oriented spoken dialogues. In: 2003 International conference on multimedia and expo. ICME’03. Proceedings (Cat. No. 03TH8698), vol 3, pp III–549. IEEE
Tacconi D, Mayora O, Lukowicz P, Arnrich B, Setz C, Troster G, Haring C (2008) Activity and emotion recognition to support early diagnosis of psychiatric diseases. In: 2008 second international conference on pervasive computing technologies for healthcare, pp 100–102. IEEE
Saxena A, Khanna A, Gupta D (2020) Emotion recognition and detection methods: a comprehensive survey. J Artif Intell Syst 2(1):53–79
Makiuchi MR, Uto K, Shinoda K (2021) Multimodal emotion recognition with high-level speech and text features. In: 2021 IEEE automatic speech recognition and understanding workshop (ASRU), pp 350–357
Pepino L, Riera P, Ferrer L, Gravano A (2020) Fusion approaches for emotion recognition from speech using acoustic and text-based features. In: ICASSP 2020—2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6484–6488
Ho N-H, Yang H-J, Kim S-H, Lee G (2020) Multimodal approach of speech emotion recognition using multi-level multi-head fusion attention-based recurrent neural network. IEEE Access 8:61672–61686
Schoneveld L, Othmani A, Abdelkawy H (2021) Leveraging recent advances in deep learning for audio–visual emotion recognition. Pattern Recogn Lett 146:1–7
Perez-Gaspar L-A, Caballero-Morales S-O, Trujillo-Romero F (2016) Multimodal emotion recognition with evolutionary computation for human-robot interaction. Expert Syst Appl 66:42–61
Middya AI, Nag B, Roy S (2022) Deep learning based multimodal emotion recognition using model-level fusion of audio–visual modalities. Knowl-Based Syst 244:108580
Imani M, Montazer GA (2019) A survey of emotion recognition methods with emphasis on e-learning environments. J Netw Comput Appl 147:102423
Koolagudi SG, Rao KS (2012) Emotion recognition from speech: a review. Int J Speech Technol 15:99–117
Wani TM, Gunawan TS, Qadri SAA, Kartiwi M, Ambikairajah E (2021) A comprehensive review of speech emotion recognition systems. IEEE Access 9:47795–47814
Muenchen TU, “Eight emotional speech databases used - tum.”
Shu L, Xie J, Yang M, Li Z, Li Z, Liao D, Xu X, Yang X (2018) A review of emotion recognition using physiological signals. Sensors 18(7):2074
Larradet F, Niewiadomski R, Barresi G, Caldwell DG, Mattos LS (2020) Toward emotion recognition from physiological signals in the wild: approaching the methodological issues in real-life data collection. Front Psychol 11:1111
Lang PJ, Bradley MM, Cuthbert BN et al (1997) International affective picture system (IAPS): technical manual and affective ratings. NIMH Center Study Emotion Attent 1(39–58):3
Merkx P, Truong KP, Neerincx MA (2007) Inducing and measuring emotion through a multiplayer first-person shooter computer game. In: Proceedings of the computer games workshop
Zhang W, Shu L, Xu X, Liao D (2017) Affective virtual reality system (AVRS): design and ratings of affective VR scenes. In: 2017 international conference on virtual reality and visualization (ICVRV). IEEE, pp 311–314
Kim J, André E (2009) Fusion of multichannel biosignals towards automatic emotion recognition. Multisensor Fusion Integr Intell Syst 35(Part 1):55–68
Matsumoto D (1993) Ethnic differences in affect intensity, emotion judgments, display rule attitudes, and self-reported emotional expression in an American sample. Motiv Emotion 17(2):107–123
Brody LR (1993) On understanding gender differences in the expression of emotion. Hum Feel Explor Affect Dev Mean, pp 87–121
Levenson RW, Carstensen LL, Friesen WV, Ekman P (1991) Emotion, physiology, and expression in old age. Psychol Aging 6(1):28
Burkhardt F, Paeschke A, Rolfes M, Sendlmeier WF, Weiss B et al (2005) A database of German emotional speech. Interspeech 5:1517–1520
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
Eyben F, Wöllmer M, Schuller B (2010) Opensmile: the munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM international conference on multimedia, pp 1459–1462
Tracy JL, Randles D (2011) Four models of basic emotions: a review of Ekman and Cordaro, Izard, Levenson, and Panksepp and Watt. Emot Rev 3(4):397–405
Russell JA (1980) A circumplex model of affect. J Pers Soc Psychol 39(6):1161
Mariotti A (2015) The effects of chronic stress on health: new insights into the molecular mechanisms of brain–body communication. Future Sci OA 1(3):FSO23
Gao T, Song J-Y, Zou J-Y, Ding J-H, Wang D-Q, Jin R-C (2016) An overview of performance trade-off mechanisms in routing protocol for green wireless sensor networks. Wireless Netw 22:135–157
Gunes H, Piccardi M (2005) Affect recognition from face and body: early fusion vs. late fusion. In: 2005 IEEE international conference on systems, man and cybernetics, vol 4, pp 3437–3443
Hazarika D, Gorantla S, Poria S, Zimmermann R (2018) Self-attentive feature-level fusion for multimodal emotion detection. In: 2018 IEEE conference on multimedia information processing and retrieval (MIPR), pp 196–201
Zheng W-L, Dong B-N, Lu B-L (2014) Multimodal emotion recognition using EEG and eye tracking data. In: 2014 36th annual international conference of the IEEE engineering in medicine and biology society, pp 5040–5043
Sahoo S, Routray A (2016) Emotion recognition from audio-visual data using rule based decision level fusion. In: 2016 IEEE students? Technology symposium (TechSym), pp 7–12
Song K-S, Nho Y-H, Seo J-H, Kwon D-S (2018) Decision-level fusion method for emotion recognition using multimodal emotion recognition information. In: 2018 15th international conference on ubiquitous robots (UR), pp 472–476
Acknowledgements
This work is funded by the Federal Ministry for Economic Affairs and Climate Action and the German Aerospace Center [grant number 50RP2260A]. We acknowledge support by the Open Access Publication Fund of the University of Duisburg-Essen.
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Duwenbeck, R., Kirchner, E.A. Auditive Emotion Recognition for Empathic AI-Assistants. Künstl Intell (2024). https://doi.org/10.1007/s13218-023-00828-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13218-023-00828-3