Keywords

1 Introduction

While the relation between the psychological status of human beings and their health was acknowledged in numerous studies [4,5,6] in the past years, conventional medicine failed to exploit this notion. In practice, it is only recently that medical experts, in parallel with the routine treatment, are investing in the improvement of the emotional status of their patients to reinforce the effects of provided therapy. Towards the same direction, bioinformatics researchers are investigating methods to better interpret, distinguish, process and quantify sentiments from various human expressions (body posture [3], speech [1], facial expression [2]), all summarized in what is called Affective Computing (AC) or Artificial Emotional Intelligence. Depending on the source of the human expression, affective computing is divided in three main categories: a. Facial Emotion Recognition (FER), b. Speech Emotion Recognition (SER), c. Posture Emotion Recognition (PER).

The importance of affective computing systems is highlighted by the engagement of many IT colossi (Google [7], IBM [9], Microsoft [8]) to implement systems of real-time affective analysis of multimedia data depicting human faces and silhouettes. As far as healthcare platforms are concerned, integrating equivalent schemes in systems which bear the responsibility of monitoring and managing patients’ biosignals is of great significance to the healing procedure, especially in the case of chronic diseases. In brief, the generation of positive emotions assists in keeping the patient in a stable psychological condition, which is the basis for fast and efficient treatment [32], whereas negative ones have the opposite effect. Apart from the integration of affective systems in healthcare platforms, rapid development of emotional AI techniques has been reported in a wide range of areas, namely Virtual Reality, Augmented Reality, Advanced Driver Assistance and Smart Infotainment as part of a general trend leading towards the alignment with human-centered computing.

In this paper, we describe the design and deployment of a FER system, incorporated in a healthcare management system as a web service to provide functionalities through the entire lifecycle of the Medical Staff – Kinsfolk – Patient interaction. Motivated by the improved results that a treatment can have when combined with the psychological management of the patient, this system will provide the ability of real time measurement and quantification of the patient’s emotions for the medical staff to assess and act upon. Moreover, correlating the emotion measurements with health-related markers collected by the system may lead to important newly discovered knowledge.

The remainder of this paper is structured in 6 sections, as follows: Sect. 2 presents the related research works, while Sect. 3 describes the proposed emotion analysis system architecture. Section 4 describes the system in practice and Sect. 5 reports the experiments conducted and the corresponding results. Finally, Sect. 6 concludes the paper.

2 Related Work

As stated earlier, the analysis, recognition and evaluation of human sentiment via pattern recognition techniques does not rely solely in the processing of facial expressions, but in the quantification of body posture and speech as well. Focusing on SER, several approaches have been proposed in the literature for the extraction of vocal features and their exploitation in forming appropriate classifying models. Methods based on the extraction of low-level features like raw pitch and energy contour [11, 12] are outperformed by high level features utilizing Deep Neural Network [13] to an extend of 20% better accuracy. PER is the least examined territory related to the field of AC. The interpretation of human emotions from body posture in an attempt to assist individuals that suffer from autism spectrum disorder is described in [14], while an approach based on theoretical frameworks investigates the correlation between patterns of body movement and emotions [15]. On the other hand, FER methodologies vary from the exploitation of Deep Belief Networks combined with Machine Learning Data Pipeline features [16], the utilization of a Hierarchical Bayesian Theme Model based on the extraction of Scale Invariant Feature Transform features [17], the capitalization of Online Sequential Extreme Learning Machine method [18] to Stepwise Linear Discriminative Analysis with Hidden Conditional Random Fields [19]. In addition, hybrid implementations of all the above-mentioned approaches that combine FER and SER are available in the literature to complete the puzzle of Affective computing methodologies [20].

In general, affective computing has been widely deployed in the blooming field of electronic healthcare. As examples of such applications, patients’ breathe is managed via emotion recognition carried out by Microsoft Kinect sensor in [21], while in [22] sentiments are analyzed via a facial landmark detecting algorithm from patients suffering from Alzheimer. Another application in electronic healthcare systems is the detection of potential Parkinson patients by recognizing facial impairment when certain expressions are formed with the generation of specific emotions [23].

An innovative notion concerning healthcare solutions is the hospital bedside infotainment systems (HBIS). These systems are designed to enhance medical staff - patient communication and promote patients’ clinical experience. Comprising internet, video, movies, radio, music, video or telephone chatting with authorized personnel or kinsfolk and biosignals monitoring in one device and connected to the Electronic Health Record (EHR), it can be proven a productive tool for healthcare ecosystems [10]. Furthermore, constant monitoring of patients can assist in the improvement of their health status and lead to early detection of potential setbacks like detection of outliers [33], poor medication adherence, changes in sleep habits.

Despite the fact that HBIS and emotion analysis services exist as stand-alone cloud-based applications, the combination of the aforementioned advances in a platform is a newly breed idea with positive effects concerning the timely intervention of specialists and kinsfolk when negative emotions or depression is detected.

3 System Architecture

3.1 Overview

The FER Restful web service is built to provide functionality as an additional feature of an existing hospital bedside infotainment system and assisted living solution [24]. The target group of this system are patients who suffer from chronic diseases or are obliged to stay in rehabilitation centers for long periods due to the reduced mobility. Another group of people affected are the elders who live independently or in far regions and conduct routine medical teleconsultations with doctors and caregivers [25]. Although the existing system provides numerous features like the monitoring of patients’ biosignals through a mobile application while conducting measurements via wearables and Bluetooth enabled devices as illustrated in Fig. 1, the contribution of this paper is focused on the real-time video communication functionality through which the ability of communication with their medical experts and kinsfolk in a 24/7 basis is rendered. The FER service operates in parallel with the video communication functionality and is called upon request of the doctor. As mentioned earlier, the importance of automated analysis of facial emotion expression is high, especially to patients and elderly people whose health status is strongly connected to their psychological condition and emotion management. In reference to the FER service, it is divided in two modules: (a) the face extraction module, (b) the emotion recognition module. The face extraction module takes place in the web browser of the client side, while the emotion recognition module occurs on the cloud platform (server side).

Fig. 1.
figure 1

Overview of the homecare platform after the integration of the FER service

3.2 The Emotion Analysis Process

In general, the basic skeleton of methodologies related to FER consists of five steps: (a) Preprocessing of images, (b) Face’s acquisition, (c) Landmarks acquisition (if necessary) (d) Facial Feature extraction, (e) Facial Expression classification. The proposed method, specifically, comprises six steps as described in the pseudocode in Fig. 2 and as follows: (a) frame extraction from the real-time streaming video, (b) face detection, (c) cropping of picture to the dimensions of the detected face (Fig. 4), (d) resizing the face picture to 256 × 256 pixels (if needed), (e) analysis of the face picture for emotions, (f) presentation of the emotions to the medical expert during the video conference, (g) storage of generated results to the patient’s personal health record. The analysis of facial images and their classification in seven different sentiments (anger, disgust, happiness, neutral, sadness, surprise, fear) is accomplished by the extraction of Speed Up Robust Features (SURF) which form a k dimensional vector as a result of the Bag of Words technique to the extracted features. Given a collection of r images, an algorithm that extracts local features is utilized to create the visual vocabulary (Visual Vocabulary). In our case the Speeded Up Robust Features (SURF) algorithm [28] extracts n 64-dimensional vectors where n is the interest points which are automatically detected by the algorithm and, in turn, described by using a Fast Hessian Matrix (SURF Descriptor) in each one of the r images (Fig. 4). Upon completion of the feature extraction process from the r images, a collection of r x n 64-dimensional vectors is formed, which represent corresponding points in a 64-dimensional space. This collection is grouped using a clustering algorithm (Kmeans++ is utilized) in k groups. The centroid of each group represents the visual word, resulting in the formation of a visual vocabulary of k visual words. The process of extracting SURF features is implemented utilizing ImageJ [26], face detection is based on the OpenCV library [27], while the processes of clustering and classification are using the WEKA tool [29].

Fig. 2.
figure 2

Emotion Analysis Process as pseudocode. Green color is indicative of the functions performed client-side, red color shows the operations performed server-side (Color figure online)

Fig. 3.
figure 3

Jaffe Image Dataset depicting seven different emotion expressions

Fig. 4.
figure 4

(left) Interest Points detected in dataset image utilizing Speeded Up Robust Features-imageJ, (right) Cropped image utilizing haar cascade classifier-opencv

The emotion recognition service is called during a video call (Fig. 5). A sequence of image frames (1 frame per second) is captured during the WebRTC video conference. In order to avoid additional overload on the network, the Face detection module is executed locally on the web browser. Cropping the image to a face bounding box reduces the amount of data being sent from client to server which in turn results to overall improved performance of the system. This is accomplished by the utilization of the recent implementation of OpenCV library in JavaScript, which provides the functionality of OpenCV models in the JavaScript runtime environment of web browsers.

Fig. 5.
figure 5

Emotion analysis operational scenario

4 The System in Practice

The functionality of the proposed solution takes place transparently as far as the users are concerned and upon selection of the medical experts. This provides the discreet capability of monitoring and registering emotional status of the patients while performing a regular video conference ‘visit’ (Fig. 6).

Fig. 6.
figure 6

Medical expert interface showing the FER results of the patient during a video conference

The results of FER are returned from the cloud service in JSON format (Fig. 7) and consequently, visualized in the user interface.

Fig. 7.
figure 7

Sample response of the FER Service

Testing the system in practice was performed by conducting 50 video sessions of 1-min duration. In these sessions, the client-side burden was handled by a PC with Quad core Intel Core i5-7400, while the server-side (cloud services) was deployed to an IaaS Cloud environment with two cores of Intel Xeon CPU E5-2650. Necessary internet connection for the communication along the two sides was provided by a typical 24 Mbps ADSL connection. The images that are captured by the camera had a resolution was set to 640 × 480 pixels. Average time in milliseconds for basic operations conducted in client side and server side are measured and depicted in Tables 1 and 2 respectively. Average time allotment for the uploading of the cropped image file from client to server is 70 ms. Observation of the measurements in both sides demonstrates that the most time-consuming operation is the feature extraction from the cropped image (server side), followed by the uploading of the image file in the client side. In addition, operations performed in the server side are far more expensive in time than those in client side, which was expected and strategically planned for the discharge of all the computationally demanding tasks from the web browser.

Table 1. Average time allotment for performing operations in the client side
Table 2. Time allotment for performing operations in the server side

Further experimentation on the requirement of running face detection on the front end is presented in Table 3. The Table illustrates the produced overhead for network, browser’s memory and CPU for the two scenarios, one for the image size set to 320 × 240 indicated as s (s for small) and the other set to 640 × 480 accordingly indicated with (l).

Table 3. Resources allotment for face detection on the web browser.

When idle, image is processed in 640 × 480; therefore, idle for (s) scenario does not exist. The experiment was conducted using the Mozilla Firefox browser (version 0.66), but the module also operates in Opera 58.0.3135.117 (64 bit) and Google Chrome 73.0.3683.86 (64-bit) without any issues. The experiments demonstrated that memory consumption is insignificantly influenced in all scenarios, whereas large variations are evidenced in data length as expected.

5 Experimental Results

While the main objective of this paper is the presentation of integration of a FER web service into a homecare platform, initial results for two scenarios of the classification of the JAFFE [30] dataset are provided. The first scenario splits the dataset into two emotional categories (positive and negative emotions, an assumption is made that anger, fear, disgust, sadness are negative emotions) and the second scenario seven emotional categories (anger, fear, disgust, neutral, happiness, surprise, sadness-Fig. 3) are provided with the utilization of various classifiers. The procedure is conducted following the 10-fold cross-validation of the whole JAFFE dataset (214 images, 256 × 256 pixels, grayscale). In order to discover the more efficient space representation of the training dataset, extensive testing of the Bag of Visual Word scheme was conducted. Kmeans++ method (350 clusters, 70 seeds) was selected among Kmeans, Canopy and Farthest First WEKA’s implementations for its ability to better distinguish inter-class and intra-class relationships. Kmeans++ improves the initialization phase of the Kmeans clustering algorithm by selecting strategically the initial seeds [31].

The accuracy of the emotion detection module is provided in Table 4. A Multilayer Perceptron (learning rate: 0.3, momentum rate: 0.2, epoch number: 500, threshold for number of consecutive errors: 20) reaches a 93,48% classification accuracy for the first scenario while the selection of K Star classifier (manual blend: 20%, value missing replaced with average) from the Weka library, achieves the best accuracy (84,03%) for the second scenario.

Table 4. Classification accuracy results for JAFFE dataset

6 Conclusion

Whereas other affective computing systems operate as stand-alone applications, this paper presents an innovative facial emotion recognition web service, integrated in a healthcare information system for monitoring and timely management of emotional fluctuations of elders and patients with chronic diseases as part of a human-centric treatment. The value of the provided functionality to classify faces into corresponding sentiments real-time during video communication sessions is of great significance especially in cases of patients with diseases related to their psychosomatic condition. Future work will be focused in the realization of a service that can execute emotion recognition in the web browser. This feature will liberate the application from its cloud based imposed restrictions. Concerning the classification performance towards the improvement of accuracy of the current prediction model, other schemes of Bag of Words techniques will be tested in an effort to provide weighted and localized information of the Visual words. Although results are promising, further testing with the utilization of larger and Caucasian oriented labeled dataset should be performed towards more thorough evaluation of the system. Correlating emotion recognition results along with information related to the biosignals and everyday routine activities of individual patients can lead to the discovery of specific patterns and valuable knowledge to the medical community.