1 Introduction

Emotion recognition system (ERS) has been popularised by the rising interest in artificial intelligence, especially towards instilling emotion in computer programs and robotics machinery [1]. This interest contributed to the growth of emotional artificial intelligence. Emotional artificial intelligence or affective computing is a field of study proposed by Rosalind Picard [2] that integrates computer science, psychology, cognition, and physiology to enable ERS [3].

A system that can recognize the emotional state of the user has huge potential in various fields. The industry that benefits from ERS includes healthcare, marketing, e-learning, entertainment, automotive, robotics, and security. In [4], ERS is used for driver’s emotion detection to promote safe driving. Meanwhile, the application of facial-based ERS for smart home automation can be found in [5]. For healthcare, various ERS applications had been proposed such as assisting in curing substance addictions [6], monitoring the emotional well-being of elderlies [7], and stress reduction therapy [8]. The ERS healthcare applications have been summarised in [9]. In [10], a multimodal ERS using facial and voice recognition is proposed to improve human–robot communication by recognizing the human’s emotion and generating an appropriate affect response.

ERS can be categorized into two approaches which are multimodal and unimodal. Meanwhile, ERS that takes bio-signals as the modality for emotion recognition is known as a physiological-based ERS. The physiological-based multimodal approach combines different biosensors while the unimodal approach only utilises a single biosensor to detect emotions. The advantages of the unimodal over multimodal approach are that the data collection procedure is simpler, and the processing time and power required are significantly lower [11].

The heart signal collected using ECG is among the bio-signal used for ERS purposes. Nowadays, ECG device comes in mobile and wearable form with reliable signal quality. However, the available ECG data collected for ERS are scarce and come in a small sample size [9]. This is due to the expensive procedures, and it is time-consuming. The COVID-19 pandemic makes it more challenging, as people are advised against physical interaction and contact. The lack of data hinders research and development of ECG-based ERS.

Hence, this work is focused on tackling the small affective ECG data challenge using a novel augmentation method of digital filters. A data augmentation increases a dataset size artificially. Although data augmentation is not common in ECG-based ERS, it is popular in cardiac pathological studies such as detecting arrhythmia [11, 12]. Here, the ECG signals are augmented by filtering the data using six filters; Neurokit, BioSPPy, Pan & Tompkins, Hamilton, Elgendi and EngZeeMod. The filters cleaned the data and removed noise using different mechanisms resulting in cleaned signals of dissimilar characteristics. The selected filters are commonly adopted for ECG filtering processes. Combining these filtered signals with un-augmented data increased the size of data by sixfold. The proposed method is then validated using our own collected data, A2ES, as well as datasets from other researchers namely AMIGOS [13] and DREAMER [14]. The data from these datasets are sourced from mobile devices namely Kardia and Shimmer. Both devices are reliable medical diagnostic tools available for consumers. All three datasets are considered small in size where the smallest is DREAMER with only 414 data. Five machine learning classifiers namely, k-nearest neighbours (KNN), support vector machine (SVM), decision tree (DT), random forest (RF) and multilayer perceptron (MLP) algorithms are used to classify the augmented and un-augmented data according to binary emotional model (BEM), affective dimensional model (ADM) of arousal and valence as well as discrete emotional model (DEM). The findings show that the proposed multi-filters augmentation offers an improvement of accuracy for all classifiers between 4 and 49%. KNN benefitted the most from the augmentation with the best accuracy obtained being 99%.

The next section is literature reviews relevant to the ECG-based ERS and data augmentation. Section 3 describes in detail the methodology of the study. The experimental settings are provided in Sect. 4. Section 5 contains the results and discussion, while Sect. 6 concludes the paper with suggestions for future works.

2 Literature Review

Due to the advantages of unimodal ERS and advantages of physiological based modalities which offer genuineness and hard to mask signals, several studies on unimodal ECG-based ERS had been reported. These works are carried either using open-sourced datasets or own datasets. For example, Zong and Chetouani [15] utilized an open-source dataset, the AuBT dataset, in their study. The AuBT’s ECG signal is filtered using an adaptive low-pass filter before the features are extracted using the fission and fusion of Hilbert Huang Transform (HHT). The data are then classified according to DEM classes using SVM, where 56% classification accuracy is achieved using fusion features and 69% using fission features.

Meanwhile, Bong, Murugappan, and Yaacob [16] designed an ECG-based ERS with their own collected dataset. The ECG signals are filtered using an elliptic bandpass filter and a discrete wavelet transform (DWT). Three-time domain features are being extracted and one of them is heart rate (HR). The results show that KNN performed better than SVM with around 10% differences between different training and testing splits. Xiefeng et al. [17] also use their own dataset. The pre-processing is done using Butterworth low-pass filter. The author extracted unspecified HRV features from the heart sound. A genetic algorithm is utilised to select the best feature combination, where 89.6% and 82.3% accuracy are achieved for valence and arousal, respectively. The accuracy for the combination of both scales using SVM is 72.9%. In other ERS work [18], ECG signals are collected for the study. The ECG signals are pre-processed using a bandpass filter with a removed 1–60 Hz baseline drift. The second filtration is done using a band-stop Butterworth filter at 49–51 Hz cut-off frequency. The features extracted are HR, HR stability and HR power and the classification is done using SVM. HR stability performed the best with 84.2% accuracy followed by HR power and HR.

Katsigiannis and Ramzan [14] proposed an ECG based ERS as part of their study. The raw ECG signals collected are directly extracted to get the PQRST statistical features and heart rate variability (HRV) features. The extracted features are classified using SVM and the accuracies achieved are around 62%. The dataset from this study is named as DREAMER and made available for other researchers. The ECG data from this dataset are adopted in this study.

Next, Correa et al. [13] proposed a dataset called AMIGOS, which is another dataset that is opened for other researchers. The dataset has ECG, electroencephalogram (EEG) and galvanic skin response (GSR) data. In AMIGOS’ original work, a unimodal ERS is built from the ECG signals, where HR and HRV features are extracted. The classification is done using SVM. Since the data collected using short and long video scenarios, the accuracy presented is divided into three parts. Short video scenarios managed to get 53% and 55% accuracy for valence and arousal. The long video scenario gives out 55% and 54% accuracy while the combination of both scenarios results are 54% and 55%. AMIGOS is also adopted in this study.

Sarkar and Etemad [19] performed an ECG-specific study on AMIGOS and DREAMER datasets. The raw ECG signals are filtered using a high-pass infinite impulse response (IIR) filter with a bandpass of 0.8 Hz. The filtered signal is then normalised using Z-score normalisation. Rather unspecified spatiotemporal features are extracted and classified. The self-supervised convolutional neural network (CNN) results show a slightly better performance than the previous literature with around 85–89% accuracies on both datasets. Siddharth, Jung, and Sejnowski [20] also adopted AMIGOS and DREAMER in their study. Additionally, they also used the data from MANHOB-HCI [21]. Raw ECG signals from the datasets are pre-processed using a moving average filter with 0.25 s of window length. Then, HRV and spectrogram features are extracted. An extreme learning machine is used to classify the models. For AMIGOS, the individual accuracy for the ECG classification is approximately 82% for both valence and arousal, while for DREAMER, the results are around 80%. For MANHOB-HCI, the accuracies achieved are around 78%. The accuracy reported for both AMIGOS and DREAMER is better than the original work.

In [22], the authors adopted the AMIGOS dataset. The raw ECG signal is pre-processed using Pan-Tompkins QRS detection and filtered with 0.5–15 Hz cut-off frequency. Various features from time, frequency and nonlinear domain are extracted. The classifier used is a deep convolutional neural network (DCNN) and the accuracy reported for valence is 71% and for arousal, 81%.

Subramanian et al. [23] extracted ECG’s HR and HRV features including other statistical features from their dataset ASCERTAIN. The classification results of the features using Naïve Bayes are better compared to SVM with 60% to 56% for valence and 59% to 57% for arousal. The ASCERTAIN dataset is opened to other researchers.

Chen et al. [24] use one of the largest affective dataset, the DECAF dataset [25]. The pre-processing steps are done by applying the Butterworth filter and extracting some features inclusive of HR. To the best of our knowledge, this study is the only ECG-based ERS study that has adopted data augmentation. A generative adversarial network (GAN) is applied to increase the number of ECG signals. GAN is the most popular data augmentation technique available [26] and some of the ordinary ways of doing it are through noise introduction, signal flipping, sine/cosine shifting, etc. [27]. The GAN generator creates fake but high-quality ECG signals while the discriminator validates them with the real signals. If the discriminator can no longer tell which signal is fake, then the generator has successfully created a string of ECG signals that is almost identical and close to the original one. The ERS performance shows an increase in accuracy when a higher ratio of augmented data is added alongside the original data in assisting the classification. Before augmentation, the study reported an accuracy of approximately 58%. The more the synthetic ECG is introduced, the higher the accuracy is achieved. The average results for SVM and RF are above 63% respective to the combination of valence and arousal scale. The drawbacks of GAN are that reliability is questionable as well as the technique requires complex and high computing power due to its dependency on deep neural networks [28].

Table 1 summarises the reviewed works of ECG-based ERS. The sample size of the datasets reviewed is relatively small. This is one of the main challenges in the field. A learning algorithm triumphs on a large dataset allowing better pattern recognition during the training phase. Additionally, it is seen that by filtering the ECG signals, the ERS can achieve a better result. For example, [19] and [22] applied the filtering method and reported improvements in accuracy compared to the original work of AMIGOS and DREAMER. But existing works only used one type of filtering technique for data pre-processing or noise elimination, none of the work used multiple filters for data augmentation. Augmentation is a data manipulation technique that synthetically increases the data count by modifying the original data [29]. Even though data augmentation is popular for amplifying data size [16,17,18] it is not popularly adopted in ECG-based ERS. Among all the research reviewed here only one which is from [24] incorporates data augmentation, however, the accuracy reported is below 70%.

Table 1 Studies of ECG-based ERS

There are several non-ECG-based ERS research that adopted data augmentation. Luo [30] applied conditional Wasserstein GAN (CWGAN) framework to EEG data as an augmentation process to enhance the ERS. The technique generates realistic EEG data in differential entropy form raw EEG data. The generated data are classified into levels of data quality and only high-quality data are appended to the training models. After augmentation, the accuracy increases by 2.97%, 9.15% and 20.13%, respectively. Since CWGAN is just another variation of GAN, the drawbacks are as previously discussed. Chatziagapi et al. [31] implemented augmentation to rebalance the class labels of speech audio data. The GAN technique is used to generate synthetic spectrograms to increase the counts of the minority emotional classes. The magnitude of augmentation is recorded from 0.4, 0.6 and 0.8 to a fully balanced dataset. At fully balanced class labels, the accuracy improves by 5% and 10%. Data augmentation is popular with images. In a work on facial expression-based ERS [32], “Augmenter” an open-source library is used to augment the images by rotating, flipping, blurring, sharpening, embossing, and skewing them. The augmentation enlarges the data size to fit deep learning training. Meanwhile, image augmentation is used in [33] to increase the size of data 10 times and avoid overfitting.

Data augmentation is commonly used in medical applications. For example, a novel ECG augmentation technique is proposed in [34] where the study tries to solve the problems of an imbalanced dataset for atrial fibrillation (AF) detection. In a clinical setting, it is challenging to get ECG signals which contain AF traits from a diverse patient background. Thus, the majority of the ECG signals are from healthy patients, while the pathological signals suffer from data deficiency. First, the ECG signals are duplicated and concatenated. Then, the extrapolated ECG signals are resampled through a randomly selected augmented sequence. As a precautionary measure, the resampling permutations are ruled to not produce an exactly similar sample. The results claimed that after balancing the dataset with data augmentation, the training accuracy and the f1-scores increased significantly. This technique is similar to augmentation through a geometric transformation in image-based data. The repeatability issues in this technique may cause low variance and high bias to the augmented sample data. Therefore, careful attention is needed as this technique is not applicable everywhere [29].

Nonaka and Seita [35] tackle the issues of insufficient AF data with RandECG, a mixture of random ECG augmentation techniques. Various signal transformation methods are explored to introduce variations in the ECG signals. The transformation includes scaling, flipping, dropping, shifting, cut-out, and other noise additions such as square pulse, Gaussian noise, etc. The observation shows a relative improvement in detection accuracy by up to 3.51%. While the disadvantage of the transformation is the noise addition technique is considered a non-ideal in increasing the data count. This technique relies on a traditional signal-based augmentation process by adding more noise to an already noisy ECG signal. When too much noise is added to the raw signal, it pollutes the data and renders the affective information obsolete [36].

The works that used data augmentation for ERS and ECG are tabulated in Table 2.

Table 2 Data augmentation for ERS and ECG

3 Methodology

This work proposed a new application of filters which are for data augmentation. As discussed in the previous section, data augmentation improves system performance. Meanwhile, in the existing studies filtering ECG signals before classification contributed to better performance. However, the works implemented one filter only and the purpose is solely for data pre-processing. Given X number of signals, their filtering generated X number of filtered signals. In this study, multiple filters are applied to the signals and the output of all the filters is combined to increase the number of signals for classification training. Specifically, the ingested data are pre-processed with six different ECG filtration techniques and the filtered outputs are combined. To study the effectiveness of the proposed augmentation technique, ERS is built using several machine learning algorithms.

Figure 1 shows the flowchart of the processes. In the proposed ERS, there are four main phases namely pre-processing and data augmentation, feature extraction, data cleaning, as well as classification and performance assessment. During the pre-processing and data augmentation, the ECG’s noises are cleaned using the six chosen filters. Besides cleaning the ECG signals, it also acts as the proposed data augmentation technique. The stream of raw ECG signals and the filtered ones are then combined, and the features are extracted. The extracted features include heart wave detection such as PQST detection and R-Peaks detection, HR statistical features, and HRV feature derivations.

Fig. 1
figure 1

Overview of the methodology

Next, the data are cleaned and separated into two parts. The first part consists of HR and HRV features extracted from raw ECG signals only, while the second part is where the workflow combined the raw and filtered ECG features. The purpose of segmenting the pipeline is to compare the performance between before and after data augmentation. Cleaned data features are scaled and standardised before being split into training and testing in both pipelines.

The last process is the classification and performance assessment. An exhaustive classification using grid search upon five machine learning techniques is implemented on the training set. The assessment is done based on testing accuracy comparison and statistical analysis.

3.1 Phase 1: Pre-processing and the Proposed Data Augmentation

Instead of noise addition or falsification, this study proposed a novel data augmentation technique through multiple ECG filtrations. Six types of ECG filters are chosen, and each of them cleans the raw and noisy signals in certain distinct ways.

The first filter is an inbuilt Neurokit’s standard filter proposed by Makowski et al. [37]. The ECG signal and its sampling rate are the two parameters passed to the filtering algorithm. The Neurokit filtering method removes slow drift and DC offset using the 5th-order high-pass Butterworth filter. The Butterworth highpass filtering method is as shown in Eq. (1) where \(n=5\). The lowcut frequency, \({f}_{c}\) is set to 0.5 Hz. The input frequency that is being filtered is represented by \(f\). Neurokit method also filters out 50 Hz powerline noise by smoothing the signal with a moving average kernel with a width of one period of 50 Hz.

$${\varvec{H}}\left({\varvec{f}}\right)=\frac{1}{\sqrt{1+{(\frac{{{\varvec{f}}}_{{\varvec{c}}}}{{\varvec{f}}})}^{2{\varvec{n}}}}}$$
(1)

The second filtering method is BioSPPy, proposed by Carreiras et al. [38]. The filter removes the ECG signal frequencies which are below 3 Hz and above 45 Hz through a finite impulse response (FIR) bandpass filter. The technique applies a linear digital filter twice, once forward and once backwards. The combined filter has zero phases and a filter order that of the original [39]. The order, \(N\) is decided based on 0.3 multiplied by the sampling rate with an addition of one, if the result is an even number. This is to enforce the order to be an odd number. Before getting the coefficient for the FIR filter, the frequency is normalised to Nyquist frequency. Then, the FIR filter is calculated based on Eq. (2). The \(x\left(n-i\right)\) is the input signals on each taps according to the order of the filter. The coefficient of the filter is represented by \({b}_{i}\) where the range satisfy \(0\le i\le N\). The filtered output signal is represented by \(y\left(n\right)\).

$${\varvec{y}}\left({\varvec{n}}\right)=\boldsymbol{ }{\sum }_{{\varvec{I}}=0}^{{\varvec{N}}}{{\varvec{b}}}_{{\varvec{i}}}{\varvec{x}}\left({\varvec{n}}-{\varvec{i}}\right)$$
(2)

Pan and Tompkins [40] filtering method for ECG signal has been around for quite some time and it is famous for accurate pre-processing of QRS detection. In the first order \((n=1)\), Butterworth bandpass filter is applied with a cut-off frequency of 5 Hz for low-pass, \({f}_{cl}\) and 15 Hz for the high-pass, \({f}_{ch}\) from Eq. (3). This method applies another derivative filter to highlight the frequency content and removes background noises. The “lfilter” from the SciPy library has the option of FIR or IIR filtration methods. The mathematical representation for the IIR filter is shown in Eq. (4). The feedforward and feedback filter order is represented by P and Q. The \(x\left(n-i\right)\) and \(x\left(n-j\right)\) are the input signals on each taps according to the order of the filter. The \({b}_{i}\) and \({a}_{j}\) are the feedforward and feedback filter coefficient while the filtered output signal is represented by \(y\left(n\right)\).

$${\varvec{B}}\left({\varvec{f}}\right)=\frac{1}{\sqrt{1+{\left(\frac{{\varvec{f}}}{{{\varvec{f}}}_{{\varvec{c}}{\varvec{l}}}}\right)}^{2{\varvec{n}}}}}.\frac{1}{\sqrt{1+{\left(\frac{{{\varvec{f}}}_{{\varvec{c}}{\varvec{h}}}}{{\varvec{f}}}\right)}^{2{\varvec{n}}}}}$$
(3)
$${\varvec{y}}\left({\varvec{n}}\right)=\sum_{{\varvec{i}}=0}^{{\varvec{P}}}{{\varvec{b}}}_{{\varvec{i}}}{\varvec{x}}\left({\varvec{n}}-{\varvec{i}}\right)-\sum_{{\varvec{j}}=1}^{{\varvec{Q}}}{{\varvec{a}}}_{{\varvec{j}}}{\varvec{y}}\left({\varvec{n}}-{\varvec{j}}\right)$$
(4)

Hamilton [41] proposed a similar ECG filtering configuration with a slight variation in the cut-off frequency. In the first order \((n=1)\), Butterworth bandpass filter is set at 8 Hz on the low-pass, \({f}_{cl}\) and 16 Hz on the high-pass, \({f}_{ch}\) threshold. The output signal becomes the coefficient for the IIR/FIR filter.

Elgendi, Jonkman, and Deboer [42] configured a second order \((n=2)\), Butterworth bandpass filter with the cut-off frequency of 8 Hz on the low-pass, \({f}_{cl}\) and 20 Hz on the high-pass, \({f}_{ch}\). Upon returning the cleaned signal, another round of IIR/FIR filtration is done based on the output coefficients.

Engelse and C. Zeelenberg [43] as well as Lourenço et al. [44] proposed a fourth-order \((n=4)\), Butterworth bandstop filter as shown in Eq. (5). The cut-off frequency is between 48 and 52 Hz for the \({f}_{cl}\) and \({f}_{ch,}\) respectively. Similarly, a digital IIR/FIR filter is configured afterwards to remove more background noises.

$${\varvec{B}}\left({\varvec{f}}\right)=\frac{1}{\sqrt{1+{\left(\frac{{\varvec{f}}}{{{\varvec{f}}}_{{\varvec{c}}{\varvec{h}}}}\right)}^{2{\varvec{n}}}}}.\frac{1}{\sqrt{1+{\left(\frac{{{\varvec{f}}}_{{\varvec{c}}{\varvec{l}}}}{{\varvec{f}}}\right)}^{2{\varvec{n}}}}}$$
(5)

Based on the technique proposed, it is observed that the multi-filter generates smooth signals with slight variations in ECG patterns. Table 3 shows the summary of the listed filters used to augment the raw ECG signals for this study. Figure 2 visualized the effects of raw ECG signal when cleaned with the listed filtering techniques. The filters removed noise, smoothed, and amplified the signal differently.

Table 3 List of chosen ECG filters for ECG augmentation
Fig. 2
figure 2

Augmented ECG using various filters

3.2 Phase 2: Feature Extraction

Feature extraction is done using the Neurokit and AuBT toolboxes. Before extracting the HR and HRV features, the PQRST wave detection is done. Allocating these heart wave points is the foundation of feature extraction in ECG analysis. The heart wave detection is performed using Neurokit and AuBT toolboxes for R peak detection and only AuBT is used for PQST wave detection.

The HR is measured in beats per minute. Normally, a lower HR implies a healthier heart and vice versa. The advantage of HR is that it is easy to measure and does not need extreme signal accuracy to acquire it. One cycle of a heartbeat can be measured between any two peaks. Using RR peaks is the most common way of detecting HR. The use of HR features for affective recognition is seen in various studies [45]. In this study, a total of 66 HR features are extracted using AuBT. Table 4 shows the summary of the statistical features derived from HR with a different type of reference.

Table 4 Statistical features derived from HR

HRV measures the variability or specific changes in time between successive heartbeats or known as the RR peaks (interval). Low HRV indicates the physiological states of stress while high HRV indicates a recovery state of a person from the condition [46]. With a proper analysis method, HRV is considered the most precise non-invasive/intrusive method to detect ANS activity [47] but it is difficult to measure while moving or during exercise [48]. HRV also contains evidence of ANS activity traits including emotional changes within an individual [49]. HRV features are the most used indicators for identifying emotions in a physiological-based system [50].

There are three domains from which HRV features are derived, namely time/temporal and geometric, frequency/spectral, and nonlinear domains. Neurokit features include all three domains while AuBT features are only available in the time domain. There are 52 and 14 HRV features extracted using Neurokit and AuBT respectively. The summary of the features is recorded in Table 5.

Table 5 List of HRV features extracted

3.3 Phase 3: Data Cleaning

3.3.1 Data Cleaning

Missing and incomplete data are common in real-world studies. However, these may handicap the statistical prediction as well as introduce bias to the results if not handled properly [51]. So, after the features have been extracted, the data cleaning processes that include empty column removal and feature imputation are conducted.

The empty column removal is a straightforward cleaning process where any feature that does not return any value is discarded. ULF and VLF features from HRV return an empty column for all ECG signals. This is because both features need a longer period of ECG recordings to literally detect the frequency bands. Thus, these feature columns are discarded.

The second cleaning method is through feature imputation. The returned features being extracted are not always clean. There are three possible returned feature states of the extracted data. For the empty cells and the ‘#NAME’, the imputation is done based on averaging the columns, and then the cell is filled or replaced with the value. Although this technique is vulnerable to bias, it is the most common imputation technique practised in data science [52]. For the ‘inf’ cells, the replacement is done using the largest value in the column.

3.3.2 Feature Scaling and Standardization

Applying feature scaling or standardization is important to reduce inaccuracy in machine learning models. Different scaling and standardization methods have been proven to affect the model’s performance [53]. Scaling data does not change the shape of the distribution, but it changes the range of the values. Meanwhile, standardizing data changes the values so that the distribution’s standard deviation equals one. Machine learning algorithms such as KNN, SVM, and multi-layer perceptron (MLP) are known to converge faster with scaled or standardized data. In this study, two methods are implemented which are Standard Scaler and Min–Max Scaler, which are adapted from [54].

3.3.3 Train/Test Split

Before feeding the machine learning models with the scaled and standardized features, the data is split into training and testing sets. The splitting method is done using Scikit Learn [54] where each feature set is divided into an 80:20 ratio of training and testing. The proportion of class labels in the training set is identical to the samples for every dataset. This is achieved through stratifying the train/test split. The random state is set to an integer for reproducible output across multiple function calls. The deterministic nature of the random state also acts to control the shuffling applied to the data before proceeding with the split.

3.4 Phase 4: Classification and Performance Assessment

The classification and performance assessment are done using five supervised machine learning classifiers. These classifiers are chosen to evaluate the effectiveness of the multi-filter data augmentation proposed. The classifiers are KNN, SVM, DT, RF and MLP.

3.4.1 K-Nearest Neighbour

KNN is a non-parametric classification algorithm that is known as a lazy learner. KNN keeps all the training data to make future predictions by computing the similarity between an input sample and its training instance. The tuned hyperparameter values in this study are the number of neighbours, weights and distance metrics. There are various distance metrics available for the KNN algorithm, but Manhattan and Hamming are selected because of their ability to learn the data well.

3.4.2 Support Vector Machine

SVM is a supervised classification algorithm that separates data into classes using hyperplanes. It also uses kernel tricks to transform the data and optimize the decision boundaries. The hyperparameters tuned for SVM are the kernel function, gamma, and C. Since the assessment involves multidimensional classification, only the radial basis function (RBF) kernel is considered. Gamma is the degree of curvature of the hyperplanes while C is the degree of the error margin.

3.4.3 Decision Tree

DT has a flowchart-like tree structure, and it is non-parametric. The highest node is considered the root with the branches that represent the decision rule with an outcome leaf node. The hyperparameters tuned for DT are the splitting criteria, minimum sample leaf, minimum sample split and maximum depth. The splitting criteria considered are Gini and Entropy.

3.4.4 Random Forest

RF creates decision trees for different samples and randomly selects the best solution by the means of voting. The logic behind RF is that the more trees are sampled, the more it reduced the bias, and the better it generalized the data. Thus, many sample decision trees make up a forest. The hyperparameter values tuned for this algorithm are the number of estimators, the maximum number of features, the maximum depth, and the criterion. Again, the criterion is actually the splitting criteria, and the considered techniques are Gini and Entropy.

3.4.5 Multi-layer Perceptron

MLP is a neural network based supervised learning algorithm that trains using backpropagation. MLP algorithm from Scikit-Learn is considered a basic deep learning model that propagates the error in a backward direction to update the weights of the hidden layers. The tuned hyperparameters are the activation function, the hidden layer size, the solver, the alpha and the learning rate.

4 Experimental Setting

4.1 Dataset

This research uses raw ECG signals from two open-source datasets as well as our own primary dataset. Our dataset is named Asian Affective and Emotional State (A2ES) dataset which comes with ECG and PPG signals along with DEM-labelled emotions. Only ECG data are considered in this research. The raw signals are recorded from 47 participants of various Asian backgrounds with 25 samples each. The ECG is collected using KardiaMobile by AliveCor. The stimuli used to elicit the emotions are a collection of videos targeting different emotions.

AMIGOS and DREAMER are open-sourced datasets available for research purposes. AMIGOS dataset [13] consists of physiological signals inclusive of ECG, EEG and GSR recorded from 40 participants with 16 samples each. The ECG signals are recorded using a mobile ECG device called Shimmer. The participants labelled their emotions according to DEM and ADM. Around 51 to 150 s of videos are used as stimuli for emotional elicitation.

DREAMER [14] is a popular physiological-based affective dataset. The dataset holds ECG signals, EEG signals as well as emotion class labels in the format of ADM scales. The size of the dataset is 23 by 18 samples, and the ECGs are recorded using Shimmer as well. The stimuli used for emotion elicitation are 65 to 393 s film clips. Table 6 summarizes the details of the dataset used in this study.

Table 6 Dataset details

Due to technical errors during data collection not all the data from these datasets can be used as the ECGs recorded suffer a loss of signal or have poor signal quality. These contribute to ineffective feature extraction on some ECG signals in the A2ES and AMIGOS dataset. The actual data used from A2ES, AMIGOS, and DREAMER are 1163 out of 1175, 1258 out of 1280, and 828. The same goes for the augmented samples where the feature extraction had trouble in processing some of the badly augmented ECGs. After augmentation, the size of the sample data is expanded to 8068 out of 8225 for A2ES, 8806 out of 8960 for AMIGOS, and 5796 for DREAMER.

The distribution of class labels after augmentation in A2ES, AMIGOS, and DREAMER datasets is shown in Fig. 3. For BEM, the class labels are either positive or negative whereas the negative emotions are emotions which contribute towards stress. For ADM, scales of high, neutral, and low are considered for both valence and arousal. The class labels for DEM in this study are happy, sad, anger, fear, disgust, surprise and neutral.

Fig. 3
figure 3

Distribution of the datasets' emotional class labels

Out of 8086 samples, 58% are labelled positive and 42% are labelled negative in A2ES BEM. For ADM-valence, 42% are labelled as low, 32% as high labels the remaining as neutral. For ADM-arousal, slightly above half of the samples are labelled high while the other half are divided almost equally between neutral (26%) and low (23%). Lastly, for DEM, the most sampled data are neutral at 26% and followed by happy at 20%. The least sampled data is anger with 657 signals out of 8086 (8%).

From the AMIGOS dataset, the distribution ratio of the class sample for BEM is 31:19 for positive and negative respectively. For ADM-valence, 41% of the data is labelled as high while low and neutral are divided approximately equal. The same goes for ADM-arousal, 43% of the data are labelled high while neutral is only 19% the remaining are low. The class distribution for DEM is the worst where the data are not evenly distributed across the 7 classes. The majority of 2648 out of 8068 signals are labelled as neutral. The smallest portion is sadness with only 212 sample data. The imbalance between emotion class labels in this dataset is huge.

Finally, from the DREAMER dataset, the distribution class sample for BEM is 61:39 for positive and negative respectively. For ADM-valence, the distribution between high and low are equal with 39% each while the rest of the 22% are neutral. For ADM-arousal, 44% of the samples are labelled as high, 29% neutral, and 27% low. Since the DREAMER dataset does not come with DEM labels, no pie chart is presented for the DEM.

4.2 Hyperparameters

Since the performance of machine learning models is dependent on their hyperparameter settings, tuning them is necessary for the best results. Table 7 summarizes the hyperparameters settings with the number of variations being explored for every classifier.

Table 7 Classifiers with the hyperparameter’s value for grid search

4.3 Evaluation Metrics

The most common evaluation metric to assess an ERS performance is using accuracy. It can be defined as the ratio of accurately classified data items to the total number of observations. Equation (6) shows the accuracy is calculated by dividing the summation of predicted true for positive (TP) and negative (TN) labelled data with the total data. The total data is calculated by summing up the TP, TN, false positive (FP), and false negative (FN).

$${\varvec{A}}{\varvec{c}}{\varvec{c}}{\varvec{u}}{\varvec{r}}{\varvec{a}}{\varvec{c}}{\varvec{y}}=\boldsymbol{ }\frac{{\varvec{T}}{\varvec{P}}+{\varvec{T}}{\varvec{N}}}{{\varvec{T}}{\varvec{P}}+{\varvec{T}}{\varvec{N}}+{\varvec{F}}{\varvec{P}}+{\varvec{F}}{\varvec{N}}}$$
(6)

5 Results and Discussion

5.1 A2ES Dataset

The results obtained for BEM, ADM and DEM are presented in Table 8. For all emotional models, the testing accuracy of which augmentation takes place shows a significant improvement especially the one with the KNN classifier.

Table 8 Results from A2ES Dataset for BEM, ADM, and DEM

Based on the observation, the best classifier before data augmentation for BEM are RF with standard scaler and MLP with minmax scaler at 61%. The testing accuracy of the rest of the classifiers is not as good with the range of 48–56%. After data augmentation, a huge leap in percentage accuracy is seen across all classifiers. The best one is KNN using standard scalar with 94% testing accuracy. Using the minmax scaler, KNN achieved 1% lesser in comparison to the previous one. More than a 40% increment in testing accuracy is observed for KNN with the introduction of augmented features. The second-best performing classifier is SVM with 81% testing accuracy, followed by MLP, RF, and DT. The later classifiers only manage to get around 64–79%.

Next, the testing accuracy recorded based on ADM-valence before augmentation is in the range of 38–45%. KNN and SVM using standard scaler performed the best with 45% testing accuracy. The worst performing algorithms are DT and MLP with standard scaler which recorded merely 38% testing accuracy. With the introduction of data augmentation, the accuracy of KNN increases by twofold. In classifying the high, neutral, and low classes of valence, KNN manages to achieve 91% and 90% testing accuracy using minmax and standard scaler respectively. For SVM and MLP, the algorithms did not perform as well as KNN but still manage to obtain 73% and 72% followed by RF and DT. However, MLP using the minmax scaler performed the worst of all as only a 9% increment is observed in the testing accuracy.

The next part shows the performance comparison recorded for classifying the ADM-arousal scale. Before augmentation, the testing accuracy achieved for all classifiers is within the range of 42–54%. The best classifier is RF for both the standard scaler and minmax scaler while the underperformed algorithm is DT with the standard scaler. After augmentation, KNN manage to get 91% of testing accuracy in classifying the features that are scaled with the standard and minmax scaler. The rest of the classifiers did manage to show some improvements with the introduction of augmentation which SVM, MLP, RF, and DT in orderly decreasing performance.

Finally, the performance of classifying DEM before data augmentation is very poor. Since the class labels are imbalanced, the machine learning algorithms suffered to recognize and generalize each distinctive emotion. Additionally, DEM has 7 classes which increases the complexity of the classification process. The testing accuracies are ranging within ~ 20% only which is considered very poor. At 27%, SVM and MLP with standard scaler reported the highest achievable accuracy. After data augmentation, the DT algorithm shows the slightest increase in performance. However, KNN, SVM, RF, and MLP manage to achieve more than twofold performance improvement. The best is KNN with 88% testing accuracy when paired with standard and minmax scalers. In this case, KNN has the largest gap in terms of performance gain compared to other classifiers. Around 65% increment is observed. Here, it shows that data augmentation is necessary to increase the count of small class samples and help the machine learning algorithm to improve the overall performance. SVM using minmax scaler manage to get 64% testing accuracy which is the second-best performance reported. The worst classifier reported is DT where for both scalers the outcomes are around 38%. For MLP and RF, the observation reported ranged between KNN and DT.

Based on the results from BEM, ADM and DEM, the implementation of data augmentation that increases data size improves classifiers' performance and changes the rank of the classifier with KNN reported as the best for all emotional models.

5.2 AMIGOS Dataset

In the AMIGOS dataset, the results obtained for BEM, ADM and DEM are presented in Table 9. Same as the previous dataset, testing accuracy is presented. Overall, the effects of augmentation are significant, and the best-performing classifier after augmentation is also KNN. As a comparison, result from AMIGOS’ original work [13] is also presented. However, only results of ADM are available.

Table 9 Results from AMIGOS Dataset for BEM, ADM, and DEM

Firstly, the results for positive and negative emotional classification are observed. Before data augmentation, the RF classifier performed the best with over 70% testing accuracy for both standard scalar and minmax scalar. Besides, the classification using SVM with standard scalar also manage to gain more than 70%. But the SVM with minmax scaler only manage to achieve 69%. For the other three classifiers, the classification performance is within the 60% range. After data augmentation, the performance accuracy for all classifiers increased. Same as the A2ES dataset, the KNN algorithm performed best by attaining around 95% testing accuracy for both scalers. MLP and SVM with minmax scaler also show quite a significant leap with more than 80% testing accuracy after augmentation. The rest of the classifiers manage to recognize BEM with lesser accuracy but no less than 70%.

The part is the results achieved when the model is trained and tested using the ADM valence scale. For all five classifiers, the testing accuracy does not exceed 60% before data augmentation. RF on both scalers reported 55% testing accuracy. SVM also manage to get 55% testing accuracy using the standard scaler, but not for the minmax scaler. This accuracy is simillar to the original study in [13]. The maximum accuracy reported is through KNN with a standard scaler which is 56%. MLP scores the lowest with only 52% on both scalers. After augmentation, the accuracy of all classifiers improved. The results for DT improved the least where the accuracy for both the standard scalar and the minmax scalar is still below 60%. The accuracy for SVM, RF and MLP is in the range of 60–80%. Meanwhile, the best classifier is KNN with 93% and 94% accuracy with respect to standard scalar and minmax scalar.

The performance of ADM arousal classification is discussed next. Before augmentation, the testing accuracy for KNN, SVM, DT, and MLP is within the range of 50% while for RF, the testing accuracy is 60% using both standard scalar and minmax scalar. The accuracy obtained in [13] is also within this range. A significant accuracy improvement is observed when the multi-filter augmentation is utilised to increase the data size. For SVM, DT, and MLP, the results are improved to between 68 and 79% for testing while for RF, the accuracy is over 80% on both scalers. The leading algorithm is KNN with 93% accuracy on standard and minmax scalers.

The last part is the results for DEM where seven basic emotions are classified together. Here, the overall recognition performance before augmentation is comparably better than A2ES. The worst testing accuracy achieved is 33% by DT. The highest reported testing accuracy is by KNN with above 50% accuracy using both scalers. Augmentation helps to improve classifiers’ performance in classifying unbalanced and multiclass datasets. This is also observed before in the A2ES dataset. The least improved classifier is SVM which reported only 55% and 68% accuracy. For DT, the testing accuracy reported is 66% and 67%. For RF and MLP, around 70–77% accuracy is reported. Alas, KNN is once again the best-performing classifier with 92% accuracy for standard scalar and minmax scaler.

5.3 DREAMER Dataset

In the DREAMER dataset, only BEM and ADM results are available. Therefore, no classification performance for DEM is reported for this dataset. The testing accuracies for each classifier for BEM and ADM are shown in Table 10. As a comparison, ADM result from [14] is also presented.

Table 10 Results from DREAMER Dataset for BEM and ADM

For BEM before augmentation, DT is the worst classifier with 65% and 69% testing accuracy for the standard scaler and minmax scaler. MLP with a standard scaler also gives out low accuracy at 67% while with a minmax scaler the result is better at 70% which is similar to KNN with a standard scaler and SVM with a minmax scaler. The highest classification accuracy is achieved through RF where the recorded results show an accuracy of 75% for the standard scaler and minmax scaler. After applying the proposed augmentation method, a staggering 99% testing accuracy is recorded by KNN. MLP and RF also passed the 90% accuracy for both scalers. For SVM with a minmax scaler, the accuracy is reported to be 93% while the one with a standard scaler is 84% which is the lowest among all. Both scalers for DT reported 86% for testing accuracy.

The results for the classification of valence show that before augmentation the highest testing accuracy achieved is 70%. This is achieved using SVM with a standard scaler. The lowest accuracy reported is 61% by DT with standard scaler and minmax scaler. Using the minmax scaler, KNN and SVM achieved 64%, while the rest are reported above that, but no more than 70% including the original study from [14]. With a 10% increase, DT is the least-performing classifier after data augmentation. Next, is SVM with a standard scaler that managed to gain a 12% increase in accuracy while the rest of the classifiers manage to achieve 90% and above for testing accuracy. SVM with minmax scaler and RF with both scalers achieved 90%. MLP on the other hand, manage to achieve 91% and 92%. Ultimately, KNN reported 98% for both standard and minmax scaler.

Lastly, the observation results for the three-scale arousal are discussed. Before augmentation, the best testing accuracies are 77% and 76% from KNN. The second-best algorithm is RF with 74% and 71%. For SVM and MLP, the testing accuracy observed is within 60–70%, including the results from [14]. The lowest accuracy achieved is 52% and 53% by DT. Post augmentation, the best classifier for classifying arousal is KNN. Both scalers give 99% testing accuracy, which is an almost perfect performance. With slightly lower accuracy, RF using both scaler and SVM for minmax scaler reported 94% and 90% testing accuracy respectively. However, SVM with a standard scaler did not perform as well, where the reported accuracy is 76%. MLP with a minmax scaler shows a 2% higher accuracy compared to the one with a standard scaler. Although DT only achieved 72% testing accuracy, the improvement gained from augmentation is more than 20%.

5.4 Statistical Analysis

Non-parametric statistical analysis is applied to examine the significant difference in the overall results recorded. The Wilcoxon signed ranks test [55] and Friedman test [56] with Holm’s post hoc test [57] were conducted as suggested by [58, 59]. The Wilcoxon signed ranks test is done to find the significant difference in the results before and after augmentation as well as between the Standard scaler and MinMax scaler. Next, the Friedman statistical test is applied to find the best classifier before augmentation and the best classifier after augmentation. Lastly, the Holm post hoc test is done if the Friedman test found a significant difference between the classifiers. The analysis is done using Knowledge Extraction based on Evolutionary Learning (KEEL) software [60].

5.4.1 Comparison of Before and After Augmentation

A Wilcoxon Signed-Rank test is applied to find the correlation between the results collected before and after augmentation regardless of the classifier or the scaler used.

The value of W obtained is 0 and the distribution is approximately normal. The z-value is -9.1035 and the p-value is p < 0.00001. The null hypothesis is rejected as the result is significantly different with a significant level of 5%. Thus, the augmentation technique proposed has significantly improved the performance of the ERS.

5.4.2 Comparison of Classifiers Before Augmentation

A multiple classifiers comparison using the Friedman test is conducted to find the significant difference and the ranking of the best classifier before data augmentation is implemented. Table 11 shows the average ranking of the algorithms where the best-ranked classifier is RF. SVM, KNN, and MLP are ranked second, third, and fourth while DT is the last one. Friedman's statistic considering reduction performance (distributed according to chi-square with 4 degrees of freedom) is 44.5545. The p-value computed by the Friedman test is p < 0.00001. Therefore, the null hypothesis that stated that all the classifiers are on par with each other is rejected. The result is significantly different at p < 0.05.

Table 11 Average ranking of Friedman test for classifiers comparison before and after augmentation

Table 12 shows the algorithms compared using Holm’s post hoc test. The z-value, p-value, and Holm value are tabulated. Holm’s procedure rejects those hypotheses that have an unadjusted p-value ≤ 0.008333. This means that for i1-i6 the pairs are statistically on par with each other while for i7-i10 there is a significant difference between the classifiers’ performance.

Table 12 Holm post hoc test p-values table for classifiers comparison before and after augmentation

5.4.3 Comparison of Classifiers After Augmentation

Multiple comparisons using the Friedman test are conducted to find the significant difference and the ranking of the best classifier after data augmentation implementation. Table 11 shows the average ranking of the algorithms where the best-reported classifier is KNN. Followed by RF, MLP and SVM while DT is the last one. Friedman's statistic considering reduction performance (distributed according to chi-square with 4 degrees of freedom) is 61.1818. The p-value computed by the Friedman test is < 0.00001. Thus, the null hypothesis is rejected as the result is significantly different at p < 0.05.

Table 12 shows the algorithms compared using Holm’s post hoc test. The z-value, p-value, and Holm value are tabulated. Holm’s procedure rejects those hypotheses that have an unadjusted p-value ≤ 0.016667. This means that from i1-i3 there is no significant difference between the classifier being compared while from i4-i10 there is a significant difference. This shows that SVM, RF, and MLP are statistically on par with each other. Most importantly classification of augmented data using KNN is significantly better than other classifiers.

5.4.4 Comparison of Scalers

A Wilcoxon Signed-Rank test is applied to find the correlation between the results collected using a standard scaler and minmax scaler regardless of the classifier and augmentation.

The value of W is 1597 and the distribution is approximately normal. The z-value is − 0.8384 and the p-value is 0.4009. Thus, the null hypothesis is retained as the result is not significantly different at p < 0.05. Thus, the performance of ERS is not determined by the type of scaler used in this study.

6 Conclusion and Future Works

This study is dedicated to tackling the problem of insufficient sample data in designing an ECG-based ERS that causes low accuracy. The application of multiple filters for data augmentation is proposed here. The augmentation technique proposed can increase the number of ECG data and increases the training samples available for the classifiers to learn on. This method is simple, and the selected filters had been proven to be good for ECG signal filtering. ERS models are built to evaluate the effectiveness of the proposed augmentation method. The results from three selected datasets, A2ES, AMIGOS, and DREAMER show that the classification accuracy increase after data augmentation is introduced. This is validated by the Wilcoxon-Signed statistical test. The KNN classifier benefits the most from the introduced augmentation as observed from the statistical test conducted. For the scaler used, the study shows that either standard or minmax can be used without any significant effect on the accuracy performance of the ECG-based ERS.

The suggestions for future works include the use of the same augmentation method to balance the class labels in the dataset. This is to ensure that the classification bias is reduced minimally due to the imbalance class. Furthermore, extending the data augmentation via a multi-filtering method to other modalities such as EEG, PPG, etc. This is to observe the effectiveness of multi-filtering augmentation. Finally, investigate the number of filters concerning the original data size and the best filter combination for augmentation.