1 Introduction

The Brain–Computer Interface (BCI) is an innovative technology which facilitates direct communication between the brain and computers or external devices. This technology has immense potential to improve the interaction capabilities of individuals with disabilities, as well as applications in industrial production and aerospace engineering [1, 2]. Traditional BCI systems use scalp electrodes based on the international 10–20 standard for recording brain signals. However, this on-scalp EEG recording approach requires stable attachment using caps, headsets, or adhesives, leading to discomfort and obtrusiveness for users. To address these issues, ‘ear-EEG’ was proposed, which involves placing electrodes around the ear, significantly enhancing invisibility, mobility, and comfort for the wearer, and offering a less intrusive experience than full-scalp EEG. In BCI research, there has been an increasing focus on steady-state visual evoked potentials (SSVEP) due to their superior signal-to-noise ratio (SNR) and the elimination of extensive training requirements [3,4,5,6,7,8,9].

In SSVEP-based BCI research, filter bank canonical correlation analysis (FBCCA) is considered more effective than the minimum energy combination (MEC) [10, 11] and power spectrum density analysis (PSDA) [12], especially in online BCI research [13,14,15,16]. However, the lower SNR of ear-EEG could potentially undermine the efficacy of CCA-based techniques [17, 18]. While CCA has several advantages, its linear nature limits its ability to capture the complex and nonlinear relationships between input and output variables. This limitation has spurred the development of deep learning-based techniques for EEG signal processing, such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Long-Short-Term Memory (LSTM) networks, and Transformers [19,20,21,22,23]. Although Transformer architectures are compelling, their requirement for large amounts of calculation and large datasets limits their development. As an alternative, the RandOm Convolutional KErnel Transform (ROCKET) was proposed [24]. ROCKET, employing random convolutional kernels for feature expansion, has demonstrated high classification accuracy and robustness in Univariate Time Series Classification (UCR) datasets.

However, due to the low signal-to-noise ratio of ear-EEG signals, these methods struggle to identify feature components in ear-EEG effectively. In this study, we propose a novel feature extraction technique based on the ROCKET method integrated with the Morlet wavelet transform (Morlet-ROCKET) for ear-EEG analysis. Compared to the traditional ROCKET method, Morlet-ROCKET shows a significant improvement in recognizing evoked potentials in ear-EEG processing. We also compare the Morlet-ROCKET method with existing methods such as FBCCA and Transformers to demonstrate its effectiveness. Because the Morlet-ROCKET feature extractor is fixed after initialization, it is more suitable for real-time processing and has lower computational complexity than deep learning-based models while maintaining high classification accuracy.

2 Experiments and data preprocessing

This study was approved by the Toyama Prefectural University Ethics Committee for Research Involving Human Subjects, and the IRB number assigned is H31-9. In ear-EEG-based SSVEP experiments, 15 participents (eleven males and four females; average age \(21.9 \pm 0.81\) years) were engaged and data acquisition employed a state-of-the-art DC digital EEG system (BIO-NVX 52) with passive Ag/AgCl electrodes. Two trials were conducted for each participant. The experiments was conducted in an EMC shielded space to minimize noise.

Ear-EEG data were recorded through electrode placement around the right ear (R1-R8) and the AFz on the forehead as a ground reference (Fig. 1). The sampling rate is 2000Hz. Participants sat in 50 cm from the LCD screen [25] and fixated on a 17 cm square stimulus. This stimulus flashing at a 19.3\(^{\circ }\) visual angle in alternating black and white.

Fig. 1
figure 1

Left: Placement of ear-EEG electrodes. Right: The experimental arrangement

Stimulus frequencies (5 Hz, 7 Hz, 9 Hz, and 11 Hz) were presented sequentially and each stimulus lasts 12 s. 60 s rest intervals were setting to prevent fatigue (Fig. 2). Participants were advised to minimize blinking and allowed sufficient eye rest.

Fig. 2
figure 2

Process of the SSVEP experiment

Ear-EEG signals were filtered further (4 Hz to 60 Hz) for artifact reduction. The initial and final portions of the data (2 s and 1 s, respectively) were discarded to ensure signal stability. The analysis was conducted on segmented data. The data was standardized across trials for all four stimulus frequencies to prepare for 4-class classification using ROCKET models. Each data window underwent standardization before further processing with FBCCA or Morlet wavelet transformation.

In this study, a Leave-One-Out Cross-Validation technique was employed to evaluate the performance of our proposed model. This method involves using a single observation from the original sample as the validation data, and the remaining observations as the training data. This process is repeated such that each observation in the sample is used once as the validation data. The dataset was partitioned into distinct training and testing sets prior to further analysis. After being split into training and testing sets, each participant’s data was segmented into non-overlapping time windows of 1 s, 2 s, 3 s, and 4 s for analyzing the temporal dynamics of the data and assessing the model’s performance across varying time scales. For the statistical comparison of classification accuracies across these different time windows, we utilized the Games-Howell post hoc test. This non-parametric test is particularly suited for situations where the assumption of homogeneity of variances is violated, as it does not require equal sample sizes or normally distributed data.

3 Methodology

3.1 FBCCA

The typical CCA method recognizes the remaining target frequency in the canonical correlation values. Considering \(\textbf{X}\in \mathbb {R}^{N\times d_1}, \textbf{Y}\in \mathbb {R}^{N\times d_2}\), where N corresponds to several observations. \(d_1\) and \(d_2\) represent the dimension of the observation. CCA, as a 2-multidimensional variable, finds a pair of linear combinations of the variables \({\omega }_x\in \mathbb {R}^{1\times d_1}, {\omega }_y\in \mathbb {R}^{1\times d_2}\) to maximize the correlation between \({\omega }_x \textbf{X}^T\) and \({\omega }_y \textbf{Y}^T\), called canonical variants. Mathematically, this relationship can be expressed as follows:

$$\begin{aligned} \rho ({\omega }_x \textbf{X}^T, {\omega }_y \textbf{Y}^T) =\max _{{\omega }_x, {\omega }_y} \frac{E\left[ {\omega }_x^T\textbf{X}\textbf{Y}^T {\omega }_y\right] }{E\left[ {\omega }_x^T \textbf{X}\textbf{X}^T{\omega }_x\right] E \left[ {\omega }_y^T\textbf{Y}\textbf{Y}^T {\omega }_y\right] } \end{aligned}$$
(1)

In this context, EEG data typically constitute \(\textbf{X}\), while \(\textbf{Y}\) is represented by reference sinusoidal signals and their harmonics. For example, for a stimulus frequency of 4 Hz, the reference signals would include frequencies of 4 Hz and its harmonics (8 Hz, 12 Hz,...) up to the Nyquist frequency.

FBCCA is a variant designed method based on CCA for SSVEP classification tasks. It could leverage a series of band-pass filters to decompose the EEG signal into several sub-band components and the CCA will be applied to each sub-band component [26]. This methodology enhances the extraction and utilization of harmonics within the EEG signal, thereby offering improved performance over the standard CCA. In this study, 4 filter banks were implemented and each defined by a distinct band-pass filter with a predetermined upper frequency limit of 80 Hz, evenly distributed lower limits to cover the full range of SSVEP frequency bands.

3.2 Morlet wavelet

Morlet wavelet stands as a specialized time-frequency analysis method which is originating in the early 1980s. It is designed for the analysis of non-stationary signals as a pivotal tool in signal processing [27]. The Morlet wavelet is mathematically defined as the element-wise product of a sinusoidal wave and Gaussian function. The key parameter is the full width at half maximum (FWHM) of the Gaussian function. Specific mathematical formulation could be used to attain optimal smoothing performance in both the temporal and spectral domains [28] as follows:

$$\begin{aligned} \omega = \exp \left( 2i\pi ft-\frac{4\ln (2)t^2}{h^2}\right) \end{aligned}$$
(2)

where t corresponds to time, f denotes frequency, and h represents the FWHM in seconds.

We considered 2 Hz\(\sim\)60 Hz as the frequency range associated with the Morlet wavelet. Morlet wavelet transform was applied to a specific time window of eight-channel ear-EEG data, in order to converting the shape of the data from (samples, channels, times) to (samples, frequencies, times). Finally, the output of the Morlet wavelet transform was used as the input for ROCKET model.

3.3 Transformer architecture

The Transformer model is distinguished for its efficiency in various sequence-to-sequence tasks and built on an architecture which integrates an encoder with a decoder. However, only encoder component of the model will be used in time-series classification tasks like SSVEP signal categorization. Each encoder block in this architecture is designed to transform input data through a series of operations that capture the intricacies of sequential information.

The core of Transformer encoder is multi-head attention mechanism. It allows model to focus on different parts of the input sequence when processing a particular element. Mathematically, the multi-head attention can be described by the following equation:

$$\begin{aligned} \text {attention}\left( \textbf{Q}, \textbf{K}, \textbf{V} \right) =\text {softmax}\frac{\textbf{Q}\textbf{K}^T}{\sqrt{D_k}}\textbf{V}, \end{aligned}$$
(3)

where queries \(\textbf{Q}\in \mathbb {R}^{N\times D_k}\), keys \(\textbf{K}\in \mathbb {R}^{M\times D_k}\), values \(\textbf{V}\in \mathbb {R}^{M\times D_v}\). NM stand for lengths of queries and keys \(D_k, D_m\) represent the dimensions of keys and values, respectively. The multi-head attention with H dimensions.

Our Transformer encoder stack comprises three such encoders in series. The output from the final encoder is passed through a fully connected layer with 256 units, followed by a dropout layer with a 0.5 rate to prevent overfitting. The dimensionality of all model components was set to 256 to maintain consistency. Additional batch normalization and a dropout layer at a rate of 0.2 were implemented for regularization (Fig. 3).

Fig. 3
figure 3

Detailed structure of the Transformer encoder used in the study

3.4 Multi-channel ROCKET

ROCKET is a feature extraction method which employs random convolutional kernels to transform time-series data based on derived features such as including the maximum value and proportion of positive values (PPV). Unlike typical deep neural networks (DNN) required back-propagation to adjust the weights of the layers, ROCKET model demonstrated the efficacy of using a vast number of random convolutional kernels to capture features that are pertinent to time-series classification, leading to a substantial reduction in the training time. An additional benefit of ROCKET is significantly reduced number of hyper-parameters compared to a DNN, which not only minimizes the time-consuming and laborious task of fine-tuning but also renders the model more accessible and robust. Originally designed for single-channel time series data, ROCKET faced limitations when applied to multichannel datasets. To address this, we re-engineered the generation of random CNN kernels. The first step involves generating random CNN kernels for a single channel. These kernels are then broadcasted to all channels. Finally, the channel axis is averaged to accommodate multichannel data. The detailed algorithmic structure of this Multi-channel ROCKET model is outlined in Algorithm 1.

Algorithm 1
figure a

Multi-Channels RandOm Convolutional KErnel Transform

We employed the ridge classifier which is an extension of the ridge regression, due to its efficacy in handling multicollinearity and high-dimensional feature spaces. The classifier was trained using ROCKET-transformed SSVEP signals to perform 4-class classification of ear-EEG data. Ridge classifier has emerged as a widely adopted approach for parameter estimation in the context of multiple linear regression due to its effectiveness in mitigating the issue of collinearity that often plagues this type of analysis by computing the sum of penalty of linear regression and value of weights (see Eq. (7)) [29]:

$$\begin{aligned} \min _{\textbf{w}} || \textbf{X} \textbf{w} - y||_2^2 + \alpha ||\textbf{w}||_2^2, \end{aligned}$$
(4)

where the X is the design matrix, y is the target, \(\textbf{w}\) is the coefficient vector. The classifier in question follows a two-step process for binary classification. Firstly, it transforms binary targets into the set \(\{-1, 1\}\). Then, it approaches the classification problem as a regression task with the same optimization objective as previously mentioned. The predicted class is determined by the sign of the regressor’s prediction. Multi-class classification analysis was treat as a multi-output regression task. The classifier identifies the predicted class by selecting the output featuring the highest corresponding value.

This study transformed SSVEP signals using ROCKET with 10,000 convolutional kernels to extract 20,000 features. Ridge classifier [30] is well suited for situations in which the number of features is larger than the number of samples. It was used to classify the transformed SSVEP signals of ear-EEG.

In conclusion, this study conducted a rigorous comparative analysis to ascertain the effectiveness of three distinct computational approaches applied to uniform ear-EEG-based SSVEP data. The methods evaluated were: the established CCA-based technique, the cutting-edge deep learning-based Transformer, and the innovative ’semi-machine’ learning method known as ROCKET. Each of these methods were used for analyzing preprocessed input data of ear-EEG recordings that were meticulously filtered and standardized. Specifically, the FBCCA method utilized 8-channel SSVEP data subjected to these preprocessing steps. In contrast, the data for Transformer and ROCKET methods were further transformed through Morlet wavelet decomposition to enhance feature extraction capabilities (Fig. 4).

Fig. 4
figure 4

Data processing workflow of this research

4 Results

4.1 SSVEP based on ear-EEG

Figure 5 delineates the SSVEP components based on ear-EEG which elicited by stimuli at varying frequencies and analyzed via Morlet wavelet transform. Notably, during the 5 Hz stimulus, the harmonic response at 15 Hz is more prominent than the fundamental 5 Hz stimulus. In the 7 Hz stimulus, both fundamental and 21 Hz harmonic exhibit comparable intensity. In 9 Hz and 11 Hz stimulus experiments, fundamental components on 9 Hz and 11 Hz are more prominent than harmonic.

Fig. 5
figure 5

Morlet wavelet-transformed ear-EEG data under 5 Hz, 7 Hz, 9 Hz, and 11 Hz stimuli

4.2 Classification results of FBCCA, Transformer and Morlet-ROCKET

When using ROCKET to analyze ear-EEG, wavelet transform is an indispensable data processing process. The value of the FWHM (denoted as h in eq (2)) has significant relevance to interpreting the result of the Morlet wavelet transform, and it is highly dependent on the specific task. Figure 6 shows the distribution of accuracy of the ROCKET model with different FWHMs.

Fig. 6
figure 6

Influence of FWHM on Morlet wavelet

We found that the model demonstrated the best performance when \(FWHM = 0.75\cdot \text{ length }\_\text{ of }\_\text{ timewindow }\). The accuracy of the target frequencies based on ear-EEG did not increase after h=0.75 in different time windows. The time-frequency analysis based on ear-EEG revealed the presence of both target frequency components and their respective harmonic frequency components. Nonetheless, it’s important to note that there were noticeable noise components present in the data.

We applied the original data to the ROCKET model to verify the necessity of the Morlet wavelet transform (Fig. 7). In Fig. 7, EEG data without Morlet wavelet has a very low classification accuracy (orange color) in ROCKET model. The substantial discrepancy observed between the original and transformed data suggests that the Morlet wavelet transform plays a critical role in preprocessing the Morlet-ROCKET model. The transformed data-based accuracies of the target frequency classification in different time windows were higher than those based on the original data, and statistical analysis demonstrated significant differences between them.

Fig. 7
figure 7

Performance difference between original and transformed data

This study used ear-EEG-based time-frequency data as input data for the Morlet-ROCKET models. Of these, Leave-One-Out cross validation was conducted (see Fig. 8). Additionally, the results from a Games-Howell test—conducted to compare each method within the same time window—are presented in Fig. 8 which indicate that Morlet-ROCKET outperforms FBCCA and Transformer in specific time window scenarios, with significant accuracy enhancements observable in the 1 s, 3 s, and 4 s windows.

Fig. 8
figure 8

Accuracy comparison between FBCCA, Transformer, and Morlet-ROCKET methods. Significance levels: ns (not significant), *(\(p<0.05\)), **(\(p<0.01\)), ***(\(p<0.001\)), ****(\(p<0.0001\)), respectively at the 95% confidence level (\(\alpha = 0.05\))

4.3 Performance of Morlet-ROCKET model

The confusion matrices in Fig. 9 reveal a direct correlation between the length of the time window and classification accuracy, with the latter increasing as the former extends. This trend is further substantiated by the ROC curves in Fig. 10, comparing the true positive rate against the false positive rate for the Morlet-ROCKET and Transformer models across different time windows. The Morlet-ROCKET model consistently demonstrates superior performance, as evidenced by the area under the curve (AUC) values, reinforcing its efficacy in SSVEP signal classification.

Fig. 9
figure 9

Confusion matrices for the Morlet-ROCKET model across varied time windows

Fig. 10
figure 10

ROC analysis for the Morlet-ROCKET model compared to the Transformer across different time windows

The comprehensive evaluation presented herein not only corroborates the superiority of the Morlet-ROCKET model over traditional FBCCA and emerging Transformer methods, but also highlights the crucial role of wavelet preprocessing in SSVEP signal classification. The methodological advancements introduced by this study could pave the way for more robust and accurate BCI systems.

5 Discussion

Based on the results in Fig. 5, we can see that visual stimulation frequencies such as 5 Hz, 7 Hz, 9 Hz, 11 Hz which are commonly used in traditional SSVEP experiments based on head electrodes can still be detected from the electrodes around ears. This evidence supports the feasibility of simplifying BCI systems by relocating electrodes to the periphery of the ears, which could potentially enhance user comfort and system practicality.

As indicated in Fig. 6, model accuracy enhancement with increased FWHM highlights the SSVEP signal’s dependency on frequency resolution over temporal precision. This observation aligns with the frequency-domain characteristics of SSVEP signals and merits further investigation into the optimal balance between frequency resolution and temporal accuracy in SSVEP-based BCIs.

The comparative analysis depicted in Fig. 7 clearly demonstrates a significant enhancement in performance when the ROCKET framework utilizes data transformed via Morlet wavelet, compared to its use of original, untransformed data. EEG data are inherently characterized by their high levels of noise and non-stationary behavior, which lead to a decrease in ROCKET’s performance as the length of the time window extends. The adoption of wavelet transforms acts as an effective strategy for diminishing noise and spotlighting signal characteristics that are more stable over time. This technique helps in preserving crucial information that might otherwise be lost when analysis is conducted solely with raw data. Such information loss is particularly evident in longer time windows due to an increase in signal variability. By integrating wavelet transforms, we can mitigate these issues, thereby enhancing the ROCKET model’s ability to analyze EEG data across various time spans.

Figure 8 demonstrates the Morlet-ROCKET model’s superior accuracy over conventional FBCCA and Transformer approaches, particularly in the 1 s, 3 s, and 4 s time windows. The prerequisite for such enhanced accuracy is the preprocessing of EEG data via Morlet wavelet transform, suggesting that raw EEG data may be ill-suited for ROCKET-based classification without such preprocessing.

The classification accuracies for different frequency stimuli which was detailed by the confusion matrix in Fig. 9, indicate a notable discrepancy in the ROCKET model’s performance across varied time windows. The improved accuracy for the 5 Hz stimulus in a 4 s window suggests that extending the time window can have a beneficial impact on the classification of lower frequency stimuli, potentially due to increased data availability for pattern recognition.

Lastly, the ROC curves in Fig. 10 further substantiate the efficacy of the Morlet-ROCKET model. The AUC values for both ROCKET and Transformer models display an upward trend with increasing time window lengths, reinforcing the notion that longer time windows may facilitate more accurate SSVEP signal classification.

These results emphasize the importance of preprocessing in SSVEP-based BCIs and suggest that the Morlet-ROCKET model, with its ability to accommodate different time windows and stimuli frequencies, could provide a robust framework for future BCI applications. Future work should aim to validate these findings in larger participant cohorts and explore the integration of this approach into real-world BCI systems.

6 Conclusion

This study introduced a hybrid model combining Morlet wavelet transform and the ROCKET algorithm for classifying SSVEP signals from ear-EEG data. The model displayed remarkable accuracy, outperforming FBCCA and Transformer in detecting frequencies of 5 Hz, 7 Hz, 9 Hz, and 11 Hz. With an accuracy of \(75.5\pm 6.7\%\), it significantly exceeds the \(40\sim 70\%\) range reported in prior ear-EEG SSVEP studies [31]. These findings not only validate the model’s efficacy but also highlight its potential for practical BCI applications, pointing towards a future direction for non-invasive and user-friendly BCI systems. Further research is necessary to explore the full potential of this approach in diverse settings and across a wider frequency range. However, the ROCKET algorithm’s tendency to saturate with large datasets suggests a scope for further optimization. Future research should focus on enhancing the model’s efficiency across a broader frequency range and in diverse settings, paving the way for more versatile and effective BCI applications.