1 Introduction

Cloud computing and wearable technologies have started to transform traditional healthcare systems. The focus of healthcare systems has been shifting from treatment to prevention with efficient and reliable personalization. The data collected through wearable sensors can be enormous, and can bring new opportunities in many fields such as medical care, fitness, aging, and finance [1]. Wearable computers can monitor various physiological signals such as heart rate, blood pressure, temperature, location and movements [2]. These physiological signals have been used for various e-health applications. E-health is the healthcare practice supported by electronic processes and communications [3]. A common practice in e-health is using wearable devices and mobile applications as sensors. Fitness and health monitoring are amongst the most popular applications for self-monitoring health devices. Self-monitoring devices combined with e-health applications may enhance personalized preventive medicine. For high-quality service, scalability and availability of data is a key issue that can be supplied by cloud computing systems.

One of our motivations is to design an emotion recognition system that can be incorporated into a health monitoring system for elderly people living alone. An increasing number of elderly people require continuous monitoring of their health and emotion states. Seafarers, long-distance drivers and pilots may also benefit from the proposed system.

Assisting health monitoring systems with emotional data may also improve overall service quality. The proposed framework is a wearable integrated cloud computing based emotion-aware healthcare system. The framework collects respiration, plethysmograph and fingertip temperature data from the subjects, and sends these signals to the cloud via mobile networks. Furthermore, it uses machine learning algorithms to process the collected data and to predict the physiological or psychological conditions of subjects. Sharing this information with a contextual receiver, e.g., a healthcare organization is also possible with the proposed framework. There are various studies on emotion recognition using physiological signals. Picard et al. used peripheral neurons system including heart rate variations, skin conductivity and respiration [4]. Li et al. also proposed an emotion recognition technique based on a method that uses three-channel bio-sensors to measure users’ electrocardiogram, galvanic skin response and photoplethysmograpy [5].

Chen et al. proposed an emotion-aware mobile cloud computing framework [6]. They proposed an infrastructure with a mobile terminal where physiological data were collected. A local cloudlet preprocessed the data to increase quality or reduce data size. Finally, a remote cloud where the analysis of all data was performed to achieve high accuracy of emotion recognition. For each user, a personal classifier is built based on historical data associated with the user’s emotion labels [6]. Chen et al. also proposed a wearable-computing assisted system for health monitoring and remote medical care. Their framework had three components which are collaborative data collection via wearables, sentiment analysis, and affective interactions. In addition, they proposed a cloud based approach to be able to make computation intensive analysis of data faster, and allocate resources dynamically [7]. Tokuno et al.’s study can provide insight about the relationship between emotion recognition and healthcare. They did not involve a cloud layer in their work but investigated the usage of emotion recognition for military healthcare. They detected emotional changes from natural speaking voice and determine if the changes are related to mental stress. They represented four different emotion classes. This study found out that the duration of soldiers’ field stay affected their values for the “joy” and “sorrow” classes. Their results demonstrated that emotion recognition can be used for mental status monitoring and mental healthcare [8]. Lu et al. proposed a prototype cloud computing based healthcare information system for elderly people. Their simulation environment consisted of a three-way data transferring pipeline centered on a cloud sharing service. Hospital, healthcare service, and homestay were defined as the three sources of elderly healthcare information. However, the prototype does not involve any smart physiological signal generation or a mobile device terminal. They concerned only with transferring and storing data on cloud [9]. Ji et al. proposed a mobile healthcare system based on cloud computing techniques. The system used a middleware on the user side to provide a plug-and-play environment for different combinations of wireless sensors and mobile terminals to utilize different communication protocols. Architecture also included an emergency first aid system [10]. Henia et al. studied human emotion recognition using electrocardiogram, galvanic skin response, skin temperature and respiration volume [11].

Koelstra et al. [12] presented a multimodal dataset (DEAP) for the analysis of human affective states. The dataset contains 32 EEG channels and physiological signals. Rather than EEG we aimed to focus on different modalities and decrease number of EEG channels/sensors as the collection of 32 channels EEG signals has substantial limitations in practical use.

In our first study on this topic, we investigated emotion recognition only from GSR signals [13, 14]. Later, we used PPG and GSR signals together and proposed a data fusion based emotion recognition method for music recommendation engines [15]. This wearable music recommendation framework utilizes not only the user’s demographics but also his/her emotion state at the time of recommendation.

In this paper, we used RB, PPG and FTT signals simultaneously. In the mentioned previous studies, the combination of RB, PPG, and FTT signals has never been used together in a similar framework. Also, we have used decision level fusion as opposed to feature level fusion [15].

2 Emotion Representation

Two models are commonly used for emotion representation by psychologists: the dimensional model and the categorical (discrete) model. According to the dimensional model, people’s emotions can be represented with a limited number of independent affective dimensions. Two important dimensions are arousal and valence. Arousal points out the intensity of emotion and valence marks the polarity of emotion as either negative or positive [16,17,18,19].

Arousal is related to activity in both our mind and body. It indicates the intensity of being awake and alert. Organs are stimulated with perception, and arousal is related to the influence of stimulation. When human beings are stimulated, they become aroused. At higher degrees of stimulation, humans are more aroused. Arousal indicates how calming/soothing or exciting/agitating the stimulus is. The arousal-nonarousal scale measures how energized or soporific one feels. Stress, anxiety, anger, and fear are high-arousing emotions whereas soothing and calming are low-arousal emotions [18]. The valence term indicates the emotional value that is associated with a certain stimulus and used to characterize/categorize specific emotions. Valence broadly refers to the positive and negative character of emotion or some of its aspects. Anger and fear are amongst the emotions with “negative valence” whereas joy has “positive valence”.

The valence-arousal dimensional model, represented in Fig. 1, is widely used for emotion representation [13, 14, 18, 20].

Fig. 1
figure 1

Valence—arousal model for emotion representation

3 Physiological Signals

3.1 Respiratory Belt (RB)

Respiratory belt (RB) (or respiration belt) is a sensor that captures the breathing activity of subjects. It can be worn on either thoracic or abdominal areas. The amount of stretch in the elastic part is measured and recorded for monitoring respiration rate or depth of breath. Emotionally, respiration rate typically decreases with relaxation. On the other hand, negative oriented emotions generally cause irregularity in the respiration pattern. In extreme tense cases, respiration may cease for a short duration.

3.2 Photo Plethysmography (PPG)

Plethysmograph measures the changes in volume within an organ or the whole body using the plethysmography technique. Photo plethysmography (PPG) is a process of applying a light source and measuring the transmitted or reflected light. The changes in light intensity are associated with small variations in blood perfusion of the tissue and provide information on the cardiovascular system, in particular, the pulse rate [21, 22]. The sensor system typically consists of a red and infrared (IR) source and detectors.

Measurements taken from a plethysmograph can be used to compute other physiological parameters such as heart rate (HR), inter-beat periods, and heart rate variability (HRV) that are highly correlated with the emotional states.

3.3 Fingertip Temperature (FTT)

Fingertip temperature (FTT) is a sensitive signal that can be used to monitor the state of relaxation. In a relaxed mood, vessels are dilated and the fingertip becomes warmer. In anxious or tense moods, the vessels shrink and fingertip becomes cooler. The change in fingertip temperature can be recorded and used for emotion recognition.

4 Data Analysis and Emotion Recognition

4.1 Dataset

In our study, we have used an existing DEAP multimodal dataset [12]. This dataset contains peripheral physiological (Galvanic Skin Response, Respiratory Belt, Plethysmograph, and Fingertip Temperature) and EEG signals of 32 participants for the analysis of human affective states. In this dataset, each subject was shown 40 different 1-min long excerpts of music videos. Subjects’ EEG and peripheral physiological signals were recorded as they watched the videos. After watching each video, subjects rated their levels of arousal and valence. The total signal record time for each video is 63 s. The data was downsampled to 128 Hz, electrooculogram (EOG) artifacts were removed, a bandpass frequency filter from 4.0 to 45.0 Hz was applied, and the data was segmented into 60 s trials and a 3 s pre-trial. A sampling frequency of 128 Hz means 8064 data points per channel.

In this study, we have focused on only RB, PPG, and FTT signals. We have not considered EEG and GSR signals in order to decrease number of sensors. We have extracted and used signals from RB, PPG, and FTT among the recorded channels.

4.2 Feature Extraction

In order to represent physiological signals, each record has been divided into varying length moving windows and statistical features have been extracted in the temporal domain. Each 63 seconds signal has been divided into smaller windows (e.g. 21 segments with 3 s per window) as shown in Fig. 2.

Fig. 2
figure 2

Feature extraction using time windows

Each feature is extracted from signal points using statistical functions as shown in Eq. 1.

$$\begin{aligned} f_i^j = \varphi _i (\mathbf x ^j) = \varphi _i (x_1^j, x_2^j, x_3^j,.., x_K^j) \end{aligned}$$
(1)

where

  • \(\varphi _i\) : function from \(\mathbb {R}^K\) to \(\mathbb {R}\), where \(\varphi _i\) stands for one of features listed in Table 1.

  • \(x_k^j\): input signal for sensor j for sample k where \(j \in \{1, 2, 3\}\) and \(k \in \{1,2,..,K\}\)

  • \(f_i^j\): feature extracted from input signal in a sub window for sensor j

  • \(\mathbf x ^j\): input vector that consists of sub window points for sensor j

  • N: Number of total sample signal points

  • K: Number of sample points extracted from input signal in a signal sub window where K \(\le\) N

Table 1 lists extracted features and their corresponding formulas [14]. From this Table, arithmetic mean value, maximum value, minimum value, standard deviation, variance, skewness, kurtosis, median, zero crossings, entropy, mean energy, moments, and change in signal values have been considered as features. Features have been extracted from each window, and their values across consecutive windows have been concatenated for each subject and each video. Concatenated features form the feature vector.

Table 1 List of 22 features extracted from data and their formula. In this table sgn, \(\oplus\), P(.) denotes sign, xor operator and probability respectively. \(S_b\) and \(S_w\) denotes the feature scatter between and within emotions

4.3 Feature Selection

Feature selection is a process in which features with more contribution to classification performance are selected and the rest are dropped. Hence, the feature selection process reduces the dimension of the feature vector and eliminates irrelevant, nois, and redundant features. In this study, we have used mRMR (minimum redundancy maximum relevance) algorithm.

Before application of mRMR, the values of features are normalized to \([-1,1]\) range using min-max normalization. Minmax is a normalization approach that linearly transforms f value to \(f_{norm}\) value using Eq. 2 where \(f_{max}\) and \(f_{min}\) are the maximum and minimum values in the feature vector f.

$$\begin{aligned} f_{norm,i} = 2\frac{f_i-f_{min}}{f_{max}-f_{min}}-1 \qquad i=1,\ldots ,L \end{aligned}$$
(2)

mRMR uses mutual information (MI) for feature selection [23, 24]. Mutual information difference (MID) and mutual information quotient (MIQ) are the two most used mRMR schemes. It selects a feature subset that characterizes the statistical property of target classification.

In order to find an appropriate feature set, different combinations of features have been selected as feature set and relationship between arousal and valence has been studied. We have obtained four main feature sets (FS) based on ranking scores. We have assigned features to feature sets according to their mRMR ranking scores. Feature Set K (FS-K) contains top ranking K features for \(K\in \{10, 14, 18, 22\}\). Hence, FS-K1 is covered by FS-K2 (FS-K1 FS-K\(_2\)) when \(K_1 < K_2\).

Figure 3 shows feature scores of the mRMR algorithm and Table 2 lists the features included within each feature subset.

Fig. 3
figure 3

Feature Scores: each feature mRMR is shown for respiratory belt signal Arousal label

Table 2 Feature subsets

4.4 Emotion Recognition Using Classifiers

The main task of classification is to find the label or category of an input represented by its feature vector. The classification process has a learning (training) step. In the training step, the classifier parameters are selected based on the training set.

Since we represent emotion with arousal and valence dimensions we have two categorical output values. Labeling the samples is required in supervised machine learning methods. In our study, we have labeled arousal and valence levels as low and high classes based on their values. In the DEAP dataset, arousal and valence values are in the (1–9) range. Rating values greater than or equal to 4.5 are assigned to label high, whereas values lower than 4.5 are assigned to label low.

$$\begin{aligned} M^j= \,& {} \varPhi (\mathbf {f}^j) = \varPhi (f_1^j, f_2^j, f_3^j,.., f_L^j) \nonumber \\=\, & {} \varPhi (\varphi _1(\mathbf {x^j}), \varphi _2(\mathbf {x^j}), \varphi _3(\mathbf {x^j}), .., \varphi _L(\mathbf {x^j})) \end{aligned}$$
(3)
  • \(M^j\) : Machine learning output for sensor \(j \in \{1,2,3\}\)

  • \(\varPhi\): Classifier that is fed with features

  • \(\mathbf {f^j}\): feature vector for sensor j

  • L: number of features where \(L \in \{10,14, 18, 22\}\)

  • \(\varPsi\): voting function, which returns the most repeated label

Random forest (RF), support vector machine (SVM) and logistic regression (LR) algorithms are used as classifiers for emotion recognition. Each classifier \(\varPhi\) is fed with extracted features as shown in Eq. 3. Best results are obtained with RF classifier which is an ensemble method with which classification is performed using multiple decision trees [25]. A decision tree is a non-parametric supervised learning method that predicts the value of a target variable by learning decision rules from the data. Decision tree partitions the dataset into groups as homogeneous as possible in terms of the variable to be predicted. Entropy and information gain are used to process attribute selection [26]. Forest of decision trees is constructed using a random subset of features, thus RFs have the ability to cope with high dimensional data.

During the training of the RF classifiers, many parameters are optimized to achieve the highest performance on training data. The feature extraction (window size, window overlap, etc.), feature selection (number of features), and other hyperparameters of the classifiers are optimized in this process. The optimum parameters are listed in Table 3.

Table 3 Selected classifier and optimized hyper-parameters for RB, PPG, and FTT signals

4.5 Decision Level Fusion

Fig. 4
figure 4

Emotion recognition pipeline

The main objective of employing fusion is to produce a fused result that provides the most detailed and reliable information possible. Fusing multiple information sources also produces a more efficient representation of the data. Lahat et al. [27, 28] defines data fusion as the analysis of several datasets such that different datasets can interact and inform each other. Data fusion techniques combine data from different sources together. Data fusion is categorized as feature level and decision level. Feature level fusion requires the extraction of different features from the source data, before the features are merged. A feature-level fusion scheme integrates unimodal features before learning concepts as given in Eq. 4.

$$\begin{aligned} \mathbf f=\, & {} [{\mathbf f^1}, {\mathbf f^2}, {\mathbf f^3} ] \\ \nonumber M= \,& {} \varPhi (\mathbf f ) \end{aligned}$$
(4)

The main advantage of this scheme is the use of only one learning stage and the benefit of mutual information from data. Decision-level fusion, or fusion of classifiers, consists of processing the classification results of prior classification stages by combining the results from multiple algorithms to yield a final fused decision. The main goal of this procedure is to take advantage of the redundancy of a set of independent classifiers to achieve higher robustness by combining their results [29] as in Eqs. 5 and 6.

$$\begin{aligned} M=\, & {} \varPsi (\varPhi (f^1), \varPhi (f^2), \varPhi (f^3)) \end{aligned}$$
(5)
$$\begin{aligned} M=\, & {} \varPsi (M^1, M^2,M^3) \end{aligned}$$
(6)
  • M : global machine learning model

  • \(\varPsi\): voting function, which returns the most repeated \(M^j\) label

In our study, we used decision-level fusion. In decision-level classification fusion, we classified the modalities individually and then combined the classifier outputs as shown in Fig. 4. These results are obtained using the optimum random forest parameters listed in Table 3.

5 Emotion Recognition Results and Discussion

In order to determine the performance of the emotion recognition system robustly, we have used leave one subject out–cross validation (LOSO–CV) method. In LOSO–CV, the data is partitioned into 32 subsets, where each subset includes an individual subject’s data. Of the 32 subject subsets, 31 subsets are used for training the model and the remaining one is used for testing. This process is repeated 32 times with a different test subset. Accuracies from 32 tests are averaged to determine the performance of the emotion recognition system. We have conducted LOSO based tests both for arousal and valence dimensions.

We performed tests using each physiological signal separately. The accuracy values corresponding to different physiological signals are listed in Table 4.

Table 4 Single modality and decision level fusion accuracy results

Using decision level fusion described in Sect. 4.5, accuracy improved from 69.86 to 73.08% for arousal, and from 69.53 to 72.18% for valence. Results indicate that using multimodal physiological signal sources improved accuracy rate compared to using modalities separately. This improvement is shown in Fig. 5.

Fig. 5
figure 5

Accuracy of both arousal and valence improved by using data from all modalities simultaneously via decision level fusion

Recognizing arousal and valence values directly from biosensors is a challenging task. We showed that there is relationship between RB, PPG and FFT signals and arousal and valence. Results indicate that decision level fusion of multi modal data yields an improvement in the accuracy rate in both arousal and valence dimensions.

This work has certain limitations. Firstly, the number of subjects used in the tests should be increased for more robust results. Also, the system performance is dependent on the handcrafted feature set. We have used basic descriptive statistics to extract features from signals. However, a broader feature set can also be considered and the effect of each feature also can be analyzed.

An application of the emotion recognition system proposed in this work is an emotion-aware health monitoring system for elderly people that live alone. An increasing number of elderly people require continuous monitoring of their health and emotional states. A remote monitoring system may allow an early action in case of dangerous situations for elderly persons’ lives (depression, anxiety, state before a heart attack or a stroke, etc.).

Seafarers, long-distance drivers, and pilots may also benefit from an emotion aware monitoring system. Pilots must control their emotions to ensure flight safety [30]. Similarly, drivers are expected to keep control of their emotions, and know how to deal with emotions. Such a monitoring system can be integrated to an airplane, car/truck or ship cabin. Alternatively, pilots, seafarers, and long-distance drivers may be screened before the beginning of each task. Screening operators of critical systems is an active research area [31,32,33,34].

6 Conclusion

This study demonstrated a framework for emotion recognition using multimodal physiological signals from respiratory belt, photo plethysmography and fingertip temperature. It is shown that decision level fusion from multiple classifiers (one per signal source) improved the accuracy rate of emotion recognition both for arousal and valence dimensions.

The results also indicate that emotion recognition from physiological signals is still a challenging task. Although, emotion recognition accuracy is improved with three input sources in this study, more sources might be required for more robust and reliable recognition such as EEG, ECG, etc. However, this brings a trade-off between an ergonomic data collection system and emotion recognition accuracy. We preferred the RB, PPR and FTT as these physiological signals are easy to collect with relatively compact and ergonomic wearable devices.

The proposed emotion recognition system can be used for emotion aware health monitoring. This may enable systems to generate healthcare actions from the subject’s temporal emotion changes.