1 Introduction

Attention is an important feature that reflects the mental state of the brain and can be measured by using electroencephalography (EEG). The measurement of the degree of attention is mainly associated with α and β waves [1]. In particular, α waves between 8 and 13 Hz with amplitudes from 30 to 50 μV are evident on the EEG of a relaxed participant with closed eyes. The β oscillations between 14 and 30 Hz with amplitudes from 5 to 20 μV are evident during active attention. Therefore, quantifying these frequency-specific features using EEG can be used to probe the level of attentiveness [2,3,4].

Previous studies have shown that for EEG attentiveness recognition, using a k-nearest neighbor (KNN) classifier based on the self-assessment manikin model can yield an average accuracy of 57.03% [5], and using support vector machine (SVM) model of power spectral density resulted in an average accuracy of 76.82% [6]. In addition, the accuracy can be increased to 81% when taking into account approximate entropy using fuzzy entropy [7]. For identifying attention during the learning process, KNN combining correlation-based feature selection (CFS) yields a classification rate of 80.84% [8]. On a single subject level, the accuracy is up to 89.4% when using the integration of common spatial pattern filtering and nonlinear mutual information method [9]. Taken together, frequency-specific and nonlinear features extracted from EEG are essential for attentiveness recognition. Therefore, in this study, we proposed a method for the characterization of the levels of attentiveness based on HilbertHuang transform (HHT) and SVM. HHT and empirical mode decomposition (EMD) have been used to process nonlinear and nonstationary brainwave signals [10, 11] in EEG analysis and clinical applications, such as emotion recognition [12, 13], motor imaginary [14], seizure detection [15,16,17], anesthesia monitoring [18, 19], and arousal detection [20]. For attentiveness recognition, HHT combined with extreme learning machine (ELM) has been proposed and yielded the highest accuracy of 85.5% (average accuracy of 72.1%) [21]. SVM, a machine learning technique, gradually becomes a popular translation method for classification with high accuracy [22,23,24]. Given a set of training samples for supervised learning, SVM can build a predictive model of the specific EEG features to perform a classifier for attentiveness recognition.

In this study, we measured the brain activity from the frontal area with one channel EEG device during participants solving some puzzles shown on the screen or during a resting period. EEG signals were first decomposed to intrinsic mode functions (IMF) by EMD, after which the instantaneous frequencies of IMFs were obtained by HHT. The resulting marginal spectra (MS) of specific frequency bands and spectral entropy (SE) entered an SVM as the feature attributes for the characterization of attentiveness.

2 Materials and Methods

2.1 Data Collection

EEG data were measured by using a commercial mobile EEG monitor (MindWave, NeuroSky) at a sampling rate of 512 Hz [25]. The unipolar recording device has a fixed channel position on the scalp surface of the forehead (Fp1), according to the International 10/20 system [26]. An ear clip (A1) of the device was used to provide a ground reference to filter out the electrical noise.

This study, numbered 201812EM027, was approved by the Research Ethics Committee of National Taiwan University, Taiwan. Twenty participants were recruited: eight males and twelve females, whose ages ranged from 20 to 26 (average age = 21.9). The participants were instructed to sit still in a quiet room, wearing the EEG monitor. Two tasks were conducted by the participants: (1) pay their attention to solving designated spot-the-difference puzzles shown on a screen for 5 min; (2) then relax with their eyes opened and fixated at a blank screen for 5 min. Figure 1 shows the EEG of a representative participant during the two tasks. The dashed lines indicate the onset of the rest task. The continuous EEG signal was epoched as a collection of time-locked trials with a time length of 1 s. The middle 200 epochs of each task were analyzed to build a classifying model. As shown in Fig. 1, the red block is considered as the attention data (200 epochs), and the green block is the relaxation data (200 epochs). Therefore, 200 attention and 200 relaxation epochs of twenty participants were analyzed to build individualized classifiers. Besides, both of the last 50 epochs were used as the testing data to validate the models. The flowchart of this study is shown in Fig. 2. The selected features of EEG data extracted by HilbertHuang analysis (EMD, HT, and MS) were entered the SVM for attentiveness recognition.

Fig. 1
figure 1

A continuous raw EEG signal of a representative participant while conducting two tasks. The participant paid attention for 5 min (until the dotted line) and then took a break with eyes opened for another 5 min. The epochs in the red block were collected as attention data, and the epochs in the green block were collected as relaxation data for further analysis

Fig. 2
figure 2

The flowchart of the proposed method. The EEG data were treated with HilbertHuang analysis to obtain the time–frequency information. The extracted features were the input of the support vector machine to build a proper classifier for attentiveness recognition

2.2 Hilbert–Huang Transform

HHT [27] is a time–frequency–energy method for the analysis of nonlinear or non-stationary data sets, the process of which can be divided into two parts: EMD and the Hilbert transform (HT) [28]. For extracting the basis for the HT, EMD has been proposed to generate finite component sets empirically from the original data. The repetitive extraction of EMD is based on the oscillatory modes and waveforms of signals in the time domain. An IMF of a signal is a function with (1) the same number of zero-crossings and extrema, and with (2) symmetric envelopes defined by local maxima and minima. Meeting these conditions, IMFs form an orthogonal basis for the original signal. Due to its simple oscillation, the definition guarantees the HT of IMF has well behavior to analyze the instantaneous frequency. The sifting process of EMD should be repeated as many times as necessary to convert the extracted signal into an IMF. Thus, a signal, \(x(t)\), can be represented as

$$x(t) = \sum\limits_{i = 1}^{n} {c_{i} (t) + r_{n} (t)} ,$$
(1)

where \(c_{i} (t)\) and \(r_{n} (t)\) are the ith IMF and the residue, respectively [14, 27]. Each raw EEG epoch in this study was decomposed into 8 IMFs and a residue, the number of which was determined by the sample length and stopping criteria of the sifting process.

After the EMD process, the HT of the ith IMF can be calculated as [29]

$$y_{i} (t) = \frac{1}{\pi }P\mathop \smallint \limits_{ - \infty }^{\infty } \frac{{c_{i} (t)}}{t - \tau }d\tau ,$$
(2)

where \(P\) is the Cauchy principal value. By arranging \(c_{i} (t)\) and \(y_{i} (t)\) into a complex pair, an analytic signal \(z_{i} (t)\) can be formed as

$$z_{i} (t) = a_{i} (t){\text{e}}^{{j\emptyset_{i} (t)}} = c_{i} (t) + jy_{i} (t) ,$$
(3)

where \(a_{i} (t)\) is defined as the instantaneous amplitude, and \(\emptyset_{i} (t)\) is defined as the instantaneous phase. Hence all HT of IMFs constitute the HHT spectrum \(H(\omega ,t)\) of the whole signal \(x(t)\), presenting time–frequency-energy information as a 3D spectrum:

$$H(\omega ,t) = HHT\{ x(t)\} = \mathop \sum \limits_{i = 1}^{n} a_{i} (t)e^{{j\smallint \omega_{i} (t)dt}} ,$$
(4)

where \(\omega_{i} \left( t \right)\) is defined as the instantaneous angular frequency \(d\emptyset_{i} \left( t \right)/dt\), and the residue \(r_{n} \left( t \right)\) is omitted. Finally, the MS, representing the accumulated energy over the entire data span from the contribution of each frequency value, can be defined as:

$$h\left( \omega \right) = \mathop \smallint \limits_{0}^{T} H\left( {\omega ,t} \right)dt .$$
(5)

Moreover, SE was further used to measure the quantities of signal disorder in the frequency domain [30], and SE can be evaluated by the normalized powers of frequencies:

$${\text{SE}} = - \frac{{\mathop \sum \nolimits_{f} (\hat{h}(f)\log_{2} \hat{h}(f))}}{{\log_{2} m}},$$
(6)

where \(\hat{h}(f) = \frac{h(f)}{\sum h(f)}\) is the normalized frequency component, and \(m\) is the number of frequency components. The normalized entropy values were between 0 (complete regularity) and 1 (maximum irregularity), showing the concentration of frequency distribution [31].

2.3 Feature Selection and Support Vector Machine

Having extracted the frequency-specific power from HHT and computed the SE, we employed the linear forward selection method to reduce the number of attributes that enter SVM.

The SVM was developed from statistical learning theory to analyze a data set for the classification of multi-classes [32, 33]. A data set is trained to acquire a mathematical model, which is used to discriminate a testing data set. For binary classification, an SVM model constructs a hyperplane that optimally separates data sets into one of two classes, and the distance from the hyperplane to the nearest data points on each side is maximized.

Assuming the testing data set is linearly separable, a general form of hyperplane can be defined by \({\mathbf{w}}^{\text{T}} {\mathbf{x}} + {\text{b}} = 0\), where \({\mathbf{w}}\) is the normal vector to the hyperplane, and \({\text{b}}\) is the bias term, and a classifier, \(d = {\text{sgn}}({\mathbf{w}}^{\text{T}} {\mathbf{x}} + {\text{b}})\), can be selected. For each data point \({\mathbf{x}}_{i}\), the following equation must be satisfied:

$$d_{i} ({\mathbf{w}}^{\text{T}} {\mathbf{x}}_{i} + {\text{b}}) \ge 1,\quad {\text{for }}\quad 1 \le i \le n .$$
(7)

\({\mathbf{w}}\) and \({\text{b}}\) are then optimized to set an optimal separating hyperplane, and the margin between the two classes is maximized [32].

If the testing data set is not linearly separable, slack variables \(\xi_{i}\) are introduced to measure the degree of misclassification, and the primal problem is modified to [33]

$$\begin{aligned} & {\text{minimize}}:\quad {\mathbf{w}}^{2} /2 + C\sum \xi_{i} \\ & {\text{subject}}\,{\text{to}}:\quad d_{i} \left( {{\mathbf{w}}^{\text{T}} {\varvec{\Phi}}({\mathbf{x}}_{i} ) + {\text{b}}} \right) \ge 1 - \xi_{i} ,\quad {\text{for}}\quad 1 \le i \le n, \\ \end{aligned}$$
(8)

where \(C\) is the regularization parameter, which controls the punishment for misclassified data points and \({\varvec{\Phi}}({\mathbf{x}}_{i} )\) maps \({\mathbf{x}}_{i}\) into a higher dimensional space to make the separation in that space easier. To reduce the computational load, the Representer Theorem [34] shows that \({\mathbf{w}}\) with large dimensionality can be written as a linear combination of the training data, \({\mathbf{w}} = \sum \alpha_{i} d_{i} {\varvec{\Phi}}({\mathbf{x}}_{i} )\). Therefore, we can optimize \(\alpha_{i}\) instead of \({\mathbf{w}}\), and the decision function becomes

$$f\left( {\mathbf{x}} \right) = \sum \alpha_{i} d_{i} K\left( {{\mathbf{x}}_{i} ,{\mathbf{x}}} \right) + {\text{b}},$$
(9)

where \(K\left( {{\mathbf{x}}_{i} ,{\mathbf{x}}} \right) = {\varvec{\Phi}}({\mathbf{x}}_{i} )^{\text{T}} {\varvec{\Phi}}\left( {\mathbf{x}} \right)\) is the kernel function. The new dual problem is modified to [35]:

$$\begin{aligned} & {\text{maximize}}:\quad \mathop \sum \limits_{i} \alpha_{i} - 1/2 \times \mathop \sum \limits_{jk} \alpha_{j} \alpha_{k} d_{j} d_{k} K\left( {{\mathbf{x}}_{j} ,{\mathbf{x}}_{k} } \right) \\ & {\text{subject}}\,{\text{to}}:\quad 0 \le \alpha_{i} \le C,\quad and\quad \mathop \sum \limits_{i} \alpha_{i} d_{i} = 0. \\ \end{aligned}$$
(10)

In this study, a Gaussian radial basis function (RBF) kernel, \(K({\mathbf{x}},{\mathbf{x}}') = { \exp }( - \gamma {\mathbf{x}} - {\mathbf{x}}'^{2} )\), was used. Both C and γ are carefully chosen to obtain optimal results.

A typical procedure of LIBSVM [33] involves several steps: (1) the input of attributes of a data set with pre-classified indices, (2) training the data to build a model, and (3) predicting the classification or information of a test data set from the model. In the c-support vector classification in this study, the attribute vectors of attention epochs were labeled as class 1 in advance, while those of relaxed epochs were labeled as class − 1. These attribute vectors with two-task labels comprised an input matrix for SVM training. After building the models, the classification of new epochs can be predicted using these models.

3 Results and Discussion

Figure 3 presents the full HHT 3D spectrum of the signal in Fig. 1, providing the time–frequency–energy distribution of the continuous data. In this trial, one of the participants paid attention for 300 s and then relaxed for another 300 s. The high energy located in a low-frequency band may include artifacts such as blinks. It is noted that the energy of high frequencies in the red block is statistically more significant than those in the green block, indicating that the alpha and beta waves are indeed different in attention/inattention tasks (T = 4.4049, p < 0.05).

Fig. 3
figure 3

The HHT spectrum of the representative participant while conducting two tasks. The colorbar shows the magnitude of the energy distribution. The conducted task was switched at the 300th second. The time–frequency-energy information of attentive and relaxed tasks are illustrated in red and green blocks, respectively

The raw signals and their IMFs of an attention epoch and a relaxation epoch are presented in Fig. 4a, b, respectively. 8 IMFs and a residual trend of the epoch were extracted by EMD. After performing the Hilbert transform, we found that IMF2, IMF3, IMF4, and IMF5 contained the power within the desired frequency range (8–30 Hz) while the IMF1, IMF6–8 contained statistically no power in the frequencies of interest (p < 0.05). The marginal spectrum of an IMF presents the time integration of its 3D spectrum and describes the distribution of power contained in the IMF as a function of frequency. Figure 5 shows the marginal spectra of IMF2 to IMF5, which were used to derive the features of frequency.

Fig. 4
figure 4

a IMFs of an EEG epoch when the representative participant was paying attention to solve a puzzle. b IMFs of an EEG epoch when the representative participant had been instructed to relax and take a rest

Fig. 5
figure 5

Marginal spectra of IMF2–5 when the representative participant was a paying attention and b taking a rest

Table 1 lists the attention assessment results using different attribute vectors of the representative participant. Regarding the impact of attributes on the accuracy, the features comprising of \(\alpha\) and \(\beta\) powers estimated from {IMF2 to IMF5 and SE} #6 obtained the best classification result with the accuracy of 93.25%. Figure 6 presents an exemplary SVM parameter selection using attribute #6. The best set of \(C\) and \(\gamma\) were chosen for building the SVM predictive model according to the scanning contour of the accuracy rate. Frequency-specific power attributes of the original EEG signal, {\(\alpha\)-original, \(\beta\)-original} #1, resulted in 85.25% accuracy, suggesting an essential role in reflecting the attentiveness. The result that {SE} #4 feature alone could obtain the prediction accuracy of 86.25% suggested that the nonlinear characteristics of brain dynamics were important indicators of the mental states. Table 1 also indicates that using more features as the attributes did not help to obtain better accuracy. The result of the feature selection of machine learning further confirmed that IMF1 and IMF6–IMF8 were not important features for classification. In other words, the most important EEG features that can reflect the state of attentiveness in this study were the attributes #6: α-IMF2, α-IMF3, α-IMF4, α-IMF5, β-IMF2, β-IMF3, β-IMF4, β-IMF5, α-SE, and β-SE.

Table 1 The impact of different attribute vectors on the accuracy of SVM models of a representative participant
Fig. 6
figure 6

The contour plot of SVM parameter selection using attribute #6

Having established the individualized predictive models of SVM, we tested the predictive model to classify the attentiveness state of each participant, as listed in Table 2. The mean self-accuracy of 84.80% with tenfold cross-validation shows that the selected features can offer efficient differentiation for the assessment of attentiveness. To validate the individualized models that we built, additional 50 attention and 50 relaxation epochs of each participant were treated as the independent test data. As listed in Table 3, the model predictions of the training data and the test data are in good agreement.

Table 2 The accuracy of SVM models of 20 participants using attributes #6
Table 3 The accuracy of predictions with independent test data

To compare with the proposed method that generated the time–frequency-domain and nonlinear features, we also implemented two analytic methods: the approximate entropy (ApEn) of original EEG signals as a time-domain and nonlinear feature, and the power spectral density of original EEG signals estimated by fast Fourier transform (FFT) as a frequency-domain and linear feature. These features were then respectively input into SVM as the attributes to build the predictive models for comparison. Figure 7 presents the boxplots of the above three methods, and the one-way ANOVA indicates the statistically significant differences in their results (p < 0.01). The SVM model of ApEn had a mean accuracy of 73.94% with the smallest standard deviation compared to the others. The SVM model of FFT had a mean accuracy of 77.80%; however, its large standard deviation pointed out that there was considerable between-subject variability in FFT features. According to the results, the proposed method of SVM combining with HHT can best discriminate between attentive and relaxed states.

Fig. 7
figure 7

The boxplots of the accuracy of SVM models using HHT marginal power density, approximate entropy, and Fourier power density as the attributes

Moreover, in this method, we used one-channel data to build the predictive models and yielded good accuracy up to 93.50%, with an average accuracy of twenty participants being 84.80%. EEG recording using more channels may help improve the system; nevertheless, the single-channel EEG monitor is inexpensive, convenient and portable for the publics. Compared with the similar studies for attentiveness recognition including FFT + SVM [6] (the highest accuracy of 76.82% and the average accuracy of 75.87%) and HHT + ELM [21] (the highest accuracy of 85.50% and the average accuracy of 72.10%), our method has better performance (the highest accuracy of 93.50% and the average accuracy of 84.80%). Our results suggest that using nonlinear HHT method instead of FFT and selecting appropriate features can improve the accuracy in attention recognition. We believe that this convenient method has the potential to be used in clinical settings for the detection of attention deficit hyperactivity disorder (ADHD), or even be a therapeutic tool as part of a biofeedback training system for people who have difficulty in paying attention.

4 Conclusion

A method of feature extraction and characterization of EEG signals using HHT frequency analysis and SVM has been presented. Raw EEG data have been analyzed by HHT to obtain marginal spectra for nonlinear and nonstationary frequency information. The α and β band powers of IMF2–5 and their SEs were selected as the attributes of SVM to obtain the mean accuracy of 84.80%. We conclude that the proposed method can offer efficient differentiation for the assessment of attentiveness, showing promise in applications of attention deficit detection or biofeedback training.