1 Introduction

The electrical activities in the brain are measured by the electrodes placed on the scalp as Electroencephalogram (EEG). In the late 1980s and early 1990s, Michael Falkensttein’s group [1] and William J. Gehring’s group [2] respectively observed the error-related negativity (ERN) in EEG. Twenty-five years later, Gehring et al. looked back on the events of discovering ERN in the paper [3]. ERN is a component of event-related potential in EEG, which is characterized by a negative deflection (Ne) at approximately 50–200 ms and a positive deflection (Pe) at approximately 200–500 ms following the error over the frontocentral and centroparietal areas of the brain [1,2,3,4]. On the other hand, the correct-related negativity (CRN) following the correct responses was also observed in EEG, showing a morphology similar to the ERN, but with a smaller amplitude. Generally, the difference waveform (error minus correct) between the ERN and CRN is considered to be error-related potential (ErrP) [4, 5]. The researchers on ErrP suggest that ErrP is elicited under certain task conditions [3, 4]. According to the different task conditions, ErrP is called response ErrP, feedback ErrP, interaction ErrP, observation ErrP, outcome ErrP, execution ErrP, and so on. For example, response ErrP occurs when a subject responds to external events as fast as possible, and feedback ErrP occurs when a subject perceives that given feedback on a task is incorrect.

The reinforcement learning theory of ERN suggests that error signals originate from the basal ganglia, spread to the anterior cingulate cortex and then to the cortex, and ERN is essential to the reinforcement learning of the brain [6]. The errors that are immediately correctable in control are believed to be represented by positive deflections originating from the posterior parietal cortex, which is hypothesized by many researchers to be involved with action conflict monitoring and movement correction [5, 6]. However, although we do know that people will have ERN when they make mistakes, a consensus view of ErrP is still elusive [3]. Many variations of ErrP that have more than two peaks with dissimilar latencies have been observed [4]. Additionally, some researchers reported that ErrP could also be observed in the motor, somatosensory, parietal, temporal, and pre-frontal areas [6].

Brain–computer interface (BCI) is a promising technology, providing a direct information pathway from the brain to external environments without depending on the peripheral nervous system and muscles. It is believed that BCI can be used to replace, restore, enhance, supplement or improve the natural output of the central nervous system. In recent years, the advancements in the field are inspiring people’s interest in BCI. More and more researchers devote themselves to the study of BCI. Various methods have been introduced to improve BCI. One of them is based on ErrP.

In 2010, Dal Seno et al. studied the use of ErrP in P300 BCI but hardly found any gain due to the low accuracy of recognizing ErrP [7]. In 2012, a series of research on using ErrP in P300 BCI were reported. Combaz et al. studied nine subjects’ EEG responses to correct and incorrect feedback of BCI and explored the possibility of detecting ErrP in a single trial and integrating the detector into the P300 Speller system [8]. Schmidt et al. carried out a performance assessment on incorporating ErrP into P300 BCI. Margaux et al. implemented a P300 BCI, for the first time, including an automatic correction function based on online ErrP detection [9]. The special P300 BCI would correct the selected character with the second best guess of a probabilistic classifier whenever ErrP was detected and was evaluated in 16 healthy subjects. Spüler et al. studied the approach to recognizing ErrP by an offline analysis on EEG data of six amyotrophic lateral sclerosis [10]. Thereafter, they developed an error correction system (ECS) that could recognize ErrP online and delete the letter that their P300 BCI responds wrong. Their conclusion is that ErrP as a secondary information source can be utilized to improve the performance of a P300 BCI. In 2015, Mainsah et al. compared the correction based on ErrP with the rectification on the basis of language models in P300 BCI [11]. In 2016, Zeyl et al. designed a two-step P300 speller including ECS and achieved an increase in the selection accuracy of P300 BCI [12]. In 2018, Cruz et al. implemented a P300 BCI including ECS [13]. Their P300 BCI selected the character with the second-highest probability as a candidate if ErrP was detected. The P300 BCI re-selected the first selected character if the second character elicited ErrP again when it was presented to the subject, otherwise accepted the candidate.

Besides P300 BCI, another kind of BCI was also improved by the means of recognizing ErrP [14]. In 2019, a review [4] summarized various studies of applying ErrP in BCIs, including robot control, wheelchair control, prosthetics, exoskeletons, gesture-enabled BCI, and so on. After that, some new similar research was still reported in succession. Yokota et al. studied the ERN of the players who were engaging in competitive video games and found that ERN could predict failures [15]. In order to facilitate using ErrP in BCI, Keyl et al. systematically examined the ErrP in the subjects with spinal cord injury (SCI) and compared the characteristics of ErrP in individuals with SCI and those of healthy subjects [16]. Chou et al. used ErrP to evaluate neural activities underlying action-monitoring dysfunction in patients with obstructive sleep apnea [17]. Ehrlich et al. studied the decoding of ErrP of persons who were interacting with a robot and the feasibility of validating robot actions by online detection of ErrP [18]. Lopes-Dias et al. investigated the feasibility of online asynchronous ErrP detection [19]. Kim et al. used ErrP in their investigation on how the subjects responded to the delayed cursor control [20].

One of the key problems of detecting ErrP is the feature extraction of ErrP. Many ways have been tried in the field. Some researchers extracted features of ErrP by computing the powers of EEG signals in certain frequency bands [21, 22]. The time-frequency analysis using wavelet transform is a common method in signal processing and was also utilized to extract the features of ErrP [21, 23]. When the EEG amplitudes of multi-channels at a moment were viewed as a sample vector, principal component analysis (PCA) and independent component analysis (ICA), which are the classical methods in the area of pattern recognition, were applied in the feature extraction of ErrP [22, 24, 25]. In the early 1990s, Koles et al. proposed the common spatial pattern (CSP) to strengthen the difference between the EEG signals of two kinds of trials [26, 27]. So far, CSP has been widely applied in the feature extraction of EEG signals, including extracting the features of ErrP [22]. In 2009, an unsupervised algorithm, xDAWN, was proposed to enhance evoked potentials in P300 BCI by estimating spatial filters, by which EEG signals were projected into a subspace [28]. In the ErrP detection, xDAWN was often used to intensify the features of ErrP [9, 22]. Additionally, windowed mean (WM), which simply averages EEG signals every time window, was also widely used [29,30,31].

As for ErrP, we do know that ErrPs occur when people make mistakes. However, we still lack a thorough understanding of ErrP [3]. Many variations of ErrP with dissimilar latencies or on different brain areas were observed [4, 6]. It is necessary to develop a feature extraction of ErrP that can adapt to the variety of ErrP across subjects. This paper presents our work towards acheiving this task. We developed an approach based on the coefficient of determination and CSP. The proposed approach is able to search several effective time windows for a subject, construct a spatial filter for each time window and extract the features of ErrP using the time windows and spatial filters.

2 Methods

The EEG differences between correct and incorrect responses occur only in certain time segments that vary across subjects [4], and these kinds of differences are likely to appear in different brain areas for subjects [6]. Our approach is based on the idea of tailoring the feature extraction process for ErrP to suit the given situation.

2.1 Difference Measure

In order to build our method, we need to identify a good difference measure of EEG signals between correct and incorrect responses. In this study, we used the coefficient of determination. Here, a course of T duration beginning with a feedback onset is called a trial. A trial is labelled +1 if the feedback is correct, or \(-1\) if the feedback is incorrect. Let \(X^i\in \mathbb {R}^{N\times T},\, i\in \{1,\ldots ,n\}\) represent the EEG signals of one of n trials, where N indicates the number of channels and T is the number of sampling points during a trial. For any \(k\in \{1,\ldots ,N\},\, l\in \{1,\ldots ,T\}\), we view \(X^i_{k,l}\) as an independent variable value and the label (i.e. +1 or \(-1\)) of the i th trial as a dependent variable value, to construct a linear regression model, thereafter compute a coefficient of determination and assign it to \(R^2_{k,l}\), where \(R^2\in \mathbb {P}^{N\times T},\, \mathbb {P}\) represents all coefficients of determination of T sampling points on N channels.

2.2 Adjusted Time Window

Obviously, \(R^2_{k,l}\) measures the EEG difference of the l th time point on the k th channel. The matrix \(R^2\) is an objective basis for going ahead. We need to search several time windows for a subject in accordance with \(R^2\). Our method to handle this problem is formally depicted as Algorithm 1, called GetRange.

Algorithm 1
figure a

GetRange

GetRange aims at getting M time windows in which the EEG differences between correct and incorrect responses are more significant than in other time ranges. Intuitively speaking, the purpose of GetRange is to find the time windows of the M biggest peaks of N curves of a subject. The input of GetRange, \(R^2\in \mathbb {P}^{N\times T}\), is the matrix of coefficients of determination for a subject, which is obtained in the way described in Sect. 2.1. The output of GetRange, \(\mathcal {R}_r, r\in \{1,\ldots ,M\}\), represents the M time windows of concern. Algorithm 1 includes three procedures: Search, Merge and Select.

The task of Search is to produce a group of time windows, denoted as \(W_k\), for the k th channel according to \(R^2_k\). The procedure Search first cuts the time range of 1-T apart into several initial time windows in the light of a fixed length and obtains a time window from each initial time window by a series of iterative updates of setting the peak of the previous window as the center of the new time window. Then it removes the time window if its peak coefficient of determination is less than the counterpart of one of its neighbor time windows and the time gap between the two peaks (TGBP) is less than a given threshold. Finally, it denotes the group of remained time windows as \(W_k\). For N channels, N groups of time windows, \(W_k, k\in \{1,\ldots ,N\}\), will be produced by the Search procedure.

The Merge procedure merges the N groups of time windows of a subject into one group, in which, among the adjacent or overlapped time windows with TGBP less than a given threshold, only the time window with the biggest peak coefficient remains. The Select procedure selects the time windows with the M biggest peak coefficients from the merged result and denotes them as \(\mathcal {R}_r, r\in \{1,\ldots ,M\}\).

2.3 Window-Adjusted Common Spatial Pattern

The EEG differences appear possibly in different brain areas for subjects [6]. In order to effectively extract the features of ErrP, we apply the common spatial pattern(CSP) [26, 27] in each of the selected time windows and combine the results of several CSPs for the feature extraction. We call the method window-adjusted common spatial pattern (WACSP), as presented in Algorithm 2.

Algorithm 2
figure b

Window-adjusted Common Spatial Pattern

The input of Algorithm 2, \(X^i\in \mathbb {R}^{N\times T},\, i\in \{1,\ldots ,n\}\), is the EEG signals of n trials in the training dataset of a subject. Its outputs \(\mathcal {R}_r, P_r, r\in \{1,\ldots ,M\}\) are respectively M selected time windows and their corresponding transformation matrices for the subject. For feature extraction, \(\mathcal {R}_r, P_r, r\in \{1,\ldots ,M\}\) can be used to cut out the EEG signals and transform them to feature vectors. Algorithm 2 contains three procedures: GetMatrix, GetRange and GetTrans.

The GetMatrix procedure obtains a matrix of coefficients of determination, \(R^2\), for a subject by computing the EEG data of n trials in the training dataset of the subject in the way described in Sect. 2.1. The GetRange procedure has been depicted in Sect. 2.2. The GetTrans procedure is constructed on the basis of the combination of the adjusted time windows and common spatial pattern. For \(\mathcal {R}_r\), GetTrans firstly cut out the signal segments of \(X^i\) in the time window into \(Y^i\in \mathbb {R}^{N\times T\!r}, i\in \{1,\ldots ,n\}\), where \(T\!r\) is the duration of \(\mathcal {R}_r\); Then, GetTrans conducts a series of computations on \(Y^i\) to obtain \(P_r\).

$$\begin{aligned} \Sigma _i=\frac{Y^i(Y^i)^T}{trace(Y^i(Y^i)^T)},\quad i\in \{1,\ldots ,n\} \end{aligned}$$
(1)

The first step of the computations is to compute the covariance matrices of n trials according to Eq. 1, and then obtain \(\Sigma _c\): the mean covariance matrix of the correct response, \(\Sigma _e\): the mean covariance matrix of the incorrect response and \(\Sigma _s\): the sum of \(\Sigma _c\) and \(\Sigma _e\).

$$\begin{aligned} \Sigma _s=B\lambda B^T \end{aligned}$$
(2)

The second step of the computations is to perform the Eigendecomposition of \(\Sigma _s\) depicted as Eq. 2, where \(\lambda \) is the diagonal matrix of eigenvalues of \(\Sigma _s\) and B a matrix being comprised of normalized eigenvectors of \(\Sigma _s\), get a matrix W by \(W=\lambda ^{-1/2}B^T\), and transform \(\Sigma _c\) and \(\Sigma _e\) to \(S_c\) and \(S_e\) by \(S_c=W \Sigma _c W^T\) and \(S_e=W \Sigma _e W^T\).

$$\begin{aligned} S_c= & {} U\psi _c U^T \end{aligned}$$
(3)
$$\begin{aligned} S_e= & {} U\psi _e U^T \end{aligned}$$
(4)
$$\begin{aligned} I= & {} \psi _e +\psi _e \end{aligned}$$
(5)

According to the references [26, 27], it can be inferred that \(S_c\) and \(S_e\) have the same eigenvectors. This can be described as Eqs. 3 and 4, where U is the common eigenvector matrix of \(S_c\) and \(S_e\) and \(\psi _c\) and \(\psi _e\) respectively diagonal matrices of eigenvalues of \(S_c\) and \(S_e\). Additionally \(\psi _c\) and \(\psi _e\) meet the relation expressed by Eq. 5, where I is an identity matrix.

$$\begin{aligned} P_r=(\hbar (U))^T W \end{aligned}$$
(6)

The final step of the computations is to obtain the matrix \(P_r\) by Eq. 6, where \(\hbar (\cdot )\) means selecting the first and last a few columns of U after sorting the eigenvectors in ascending order of the eigenvalues.

For feature extraction, the EEG signals of a trial are cut into \(Y_r\) in the light of \(\mathcal {R}_r, r\in \{1,\ldots ,M\}\). The feature vector of the trial can be obtained by Eq. 7.

$$\begin{aligned} v_f=\mathcal {F}(P_1 Y_1,\ldots , P_M Y_M) \end{aligned}$$
(7)

where \(\mathcal {F}(\cdot )\) represents the computation that transforms \(P_1 Y_1,\ldots , P_M Y_M\) into a feature vector. \(\mathcal {F}(\cdot )\) may be implemented in various ways. One of the usual implementations is to concatenate all rows of \(P_1 Y_1,\ldots , P_M Y_M\) into a vector, and, sometimes, a downsampling is further applied. Another one is to compute the variance of each row of \(P_rY_r, r\in \{1,\ldots ,M\}\) and concatenate all variances into a vector.

In this study, the sample of a trial was the EEG signals of 1000 ms on 32 channels, that is, \(X^i\in \mathbb {R}^{32\times 1000}\); the fixed length of the time window was 101 ms and the length would be adjusted if the time window went beyond the range of 1-T; the threshold of TGBP was set 50 ms. The implementation of WACSP produced four time windows, that is, \(M=4\). It set \(\hbar (\cdot )\) to select the first and last four columns of U, meaning that every time window produced a \(P_r\in \mathbb {R}^{8\times 32}\), which projected the 32-channel EEG signals in the time window into eight virtual channels by \(P_rY_r\). Finally, it computed the variances of the four time windows on each virtual channel to obtained the 32-dimension feature vectors.

2.4 Other Methods

Besides WACSP, we also implemented the commonly used approaches in the field and compared them with WACSP. All implemented methods transformed one sample into a 32-dimensional vector in their respective ways. Here, we briefly introduce these methods, including Windowed Means (WM), Band Power (BP), Time-frequency (TF), Principal components analysis (PCA), independent component analysis (ICA), xDAWN and common spatial pattern (CSP).

WM, a simple but widely used technique, obtains the feature vectors by averaging EEG signals in every 250-ms time window [29,30,31]. BP means using the EEG powers of a few frequency bands of concern as the features [21, 22]. To carry out BP in this study, we calculated the power in the frequency band of 1–30 Hz every 250-ms time window, and connected the band powers into a feature vector. To obtain 32-dimensional feature vectors, we had to confine the calculations of the WM and BP on the channels of F3, Fz, F4, FC3, FCz, FC4, Cz and CPz, which are most closely related to the ErrP. TF approach usually uses wavelet transformation to process EEG signals and selects the time-frequency points that significantly differ under different conditions as features according to Fisher Criterion. The detail is described in Algorithm 1 in [23]. The implementation of TF selected, as a feature vector, the 32 time-frequency points whose differences under different conditions were at the top level.

PCA and ICA both are classical methods in the area of Pattern Recognition. When applied to feature extraction of EEG signals, they view the EEG amplitude vector of all channels at a time point as one vector and, by a series of computations on the vector collections, obtain the matrices that are able to transform the raw EEG signals to the signals on a few virtual channels [22]. Certainly, the principles and results of the two methods are different from each other. Originally, xDAWN was proposed to enhance P300 evoked potentials in BCI [28]. Afterwards, it was extensively exploited in similar tasks [9, 22]. By an unsupervised algorithm, xDAWN estimates a spatial filter to project the raw EEG signals to a signal subspace [28]. Similarly, CSP, which was proposed for the purpose of processing EEG signals [26, 27], also constructs a spatial filter, but its core is a supervised algorithm. A lot of research shows that CSP is a very good method in the field of processing EEG signals [22]. In this study, PCA, ICA, xDAWN, and CSP all projected the EEG signals of one sample to the signals of eight virtual channels in their respective ways, computed a variance every 250 milliseconds on each virtual channel and connected all variances into a feature vector.

3 Experiment and Data

In this study, we recruited 20 right-handed BCI-naive subjects (10 males, 10 females), whose age ranged from 19 to 28 years, with a mean age of 23 years and a standard deviation of 2.35, to participate in the experiment. All subjects with a history of visual or neurological disorders, head trauma or any drug use that would affect nervous system function were excluded, and the subjects were asked to wash their hair before the experiment. This experiment was approved by the Institutional Review Board at Fuzhou University. In accordance with the Helsinki Declaration of Human Rights, informed consent for the experimentation was obtained from all subjects after a detailed explanation of the study.

Fig. 1
figure 1

The flow chart of the experiment and EEG data processing

As shown in Fig. 1, all subjects carried out two sessions respectively at two different times. The first session used a pseudo-detector to detect the symbols. A pseudo-detector means that the subjects were provided with the outcomes that were generated by the BCI platform according to the configured error rates, and they were told that a real detector was working. In the second session, a real classifier trained on the data of the first session by stepwise linear discriminant analysis (SWLDA) [32] was adopted to recognize P300 and the results of recognizing P300 corresponding to a symbol were synthesized to detect the symbol. Each session included 14 runs. In a run, the subjects selected 18 symbols of Chinese pinyin through the interaction of brain–computer on the platform. For each symbol, six sequences of flashes were presented. Every time a symbol was selected by the platform, it was presented to the subject. The selected symbol was possibly correct or incorrect. For a subject, the ratio of the incorrect number to the total number of symbol detection is called Error Rate. The mean error rate of the 20 subjects was 24.9% and the standard deviation was 5.6%.

Fig. 2
figure 2

The course of selecting a symbol on the BCI platform

Based on BCI2000 [33], we developed a BCI platform for Chinese pinyin. In nature, the BCI platform is a P300 BCI system [34]. The BCI platform uses a standard \(6\times 6\) matrix of symbols [34] to present stimuli in the way proposed by Townsend et al. [35], which is called checkerboard stimulus paradigm. According to the checkerboard stimulus paradigm, the 36 symbols are randomly rearranged in an inner matrix of \(6\times 6\). In a sequence, the six groups of symbols in the presentation matrix corresponding to six rows of the inner matrix and another six groups of symbols in the presentation matrix corresponding to six columns of the inner matrix are flashed one time in random order. Normally, a few sequences are essential for the detection of a symbol. The main differences between the BCI platform and traditional P300 BCIs lie in that the symbols in the presentation matrices are not English characters but the initial consonants, vowels or tones of Chinese pinyin and the BCI platform includes a formal procedure of result presentation after P300 detection, allowing subjects to judge whether the result is correct or not while observing it.

Figure 2 shows an experiment course of selecting a symbol on the BCI platform. This course includes 3 steps. In Step 1, the current target is cued for 2.4s using an ellipse frame, as shown in the first screen of Fig. 2. Step 2 contains six sequences of flashes. In each sequence, each of the 12 groups of symbols is flashed one time. There are no breaks in-between sequences. The intensification duration of the flash is equal to 80 ms and the interval between the successive intensification onsets is 120 ms. Step 2 is shown in the section from the second screen to the third screen of Fig. 2. In Step 2, the subjects are instructed to silently count how many times the target has been flashed to keep their attentions. In Step 3, the result of P300 detection is presented, as shown in the fourth screen of Fig. 2, and the BCI platform records the EEG signals while the subjects are observing the feedback. In the context of this paper, Step 3 of the course is called a trial.

A 64-channel Neuroscan system, including the EEG cap, the amplifier, and the signal acquisition software, was used to acquire EEG signals. For convenience, the EEG signals of only 32 channels were recorded. The 32 channels were: FP1 FP2 F7 F3 Fz F4 F8 FT7 FC3 FCz FC4 FT8 T7 C3 Cz C4 T8 TP7 CP3 CPz CP4 TP8 P7 P3 Pz P4 P8 PO7 PO8 O1 Oz O2. The sampling rate was set to 1000 Hz. The procedure of constructing the EEG datasets is included in Fig. 1. The raw EEG signals were firstly preprocessed by common average reference and finite impulse response with the order at 64 and the frequency range of 0.1-30Hz. Next, the EEG signals of 1000 ms beginning with the feedback onsets were segmented after the baseline correction of subtracting the mean of 200 ms before the feedback onset. Then, Savitzky-Golay filter [36] with the order at 3 and the window length at 101 was applied to smoothen the EEG segments. Finally, every EEG segment, as the sample of a trial, was labelled according to the feedback result, +1 if correct or \(-1\) if incorrect, and was added to the datasets of the subjects. In the dataset of a subject, there were 504 samples, including about 399 samples corresponding to correct feedbacks (labelled as +1) and about 125 samples corresponding to incorrect feedback (labelled as \(-1\)).

Fig. 3
figure 3

Comparison of classification performances. The Y axis represents the Acc, AUC and F1 values of the proposed WACSP method in figure (a), (b) and (c) respectively; and the X axis represents the corresponding Acc, AUC and F1 values of CSP, PCA, BP, ICA, xDAWN, TF or WM in figure (a), (b) and (c) respectively. Each point in the figure represents a comparison of WACSP and one of the methods of CSP, PCA, BP, ICA, xDAWN, TF and WM for one subject. In total, there are 20 subjects. For example, point P in figure (a) represents the Acc values for the subject S11 0.68 and 0.57 by WACSP and TF respectively. Points above the dash line indicate that WACSP outperforms the compared methods (the value of Y axis is greater than that of X axis). The figure shows that for all three indicators of Acc, AUC and F1, WACSP achieves better results for most subjects

Table 1 Ranking the eight methods
Table 2 Comparing the accuracies of various methods. The means and standard deviations of accuracies are shown in % for each method. The comparison was performed using repeated measures ANOVA (p value < 0.05) with LSD adjustment. The legend \(\uparrow \) represents significantly higher and \(\downarrow \) significantly lower. The legend entries are interpreted row-wise. For an example, \(\uparrow \) in (1,2) means that the accuracies of WACSP are significantly higher than those of CSP
Table 3 Comparing the AUCs of various methods. The means and standard deviations of AUCs are shown in % for each method. Likewise, the comparison was performed using repeated measures ANOVA (p value < 0.05) with LSD adjustment. The legends \(\uparrow \) and \(\downarrow \) respectively represents significantly higher or lower. The legend entries are interpreted row-wise. For an example, \(\uparrow \) in (2,4) means that the AUCs of CSP are significantly higher than those of BP
Table 4 Comparing the F1 of various methods. The means and standard deviations of F-measures are shown in % for each method. Similarly, the comparison was performed using repeated measures ANOVA (p-value < 0.05) with LSD adjustment. The legends \(\uparrow \) and \(\downarrow \) respectively indicates significantly higher or lower. The legend entries are interpreted row-wise. For an example, \(\downarrow \) in (4,6) means that the F-measures of BP are significantly lower than those of xDAWN

4 Results

In this study, every method transformed the sample set of one subject into a feature vector set. The feature vector set of each subject was randomly split into five subsets. Each subset of a subject was in turn selected as a test set and the other four subsets were merged as the corresponding training set. Here, each test set contains the feature vectors of 101 or 100 trials, so the chance level of classification accuracy is 0.58. Since our focus was on feature extraction, only shrinkage linear discriminant analysis (sLDA) [29, 37], a way that performed very well in EEG classification [22, 29], was further used to train the classifiers on the training sets and classify the feature vectors in the corresponding test sets. According to [38], the three performance indexes of classification were used to compare our approach with the commonly used methods. They were Accuracy (Acc), Area Under the Receiver Operating Characteristics Curve (AUC) and F-measure (F1). Due to the 5-fold procedure, every method obtained the three performance values five times for a subject. In Fig. 3, the mean values are presented for each performance.

Figure 3 graphically shows the comparisons of Acc, AUC and F1 between WACSP and one of the other methods on all subjects. In Fig. 3, (a) is the comparison of accuracy, (b) presents the comparison of AUC, (c) exhibits the comparison of F1, the Y value of each point is the performance value of WACSP on a subject, its X value represents the performance value of one of the other methods on the subject. Other methods include CSP, PCA, BP, ICA, xDAWN, TF and WM. A legend represents a specific method. It means that WACSP outperforms the specific method on the subject when a point lies above the diagonal. According to Fig. 3, we can roughly conclude that WACSP performed better than the other methods. We ranked the eight methods respectively in accordance with the means of Acc, AUC and F1. The ranking result is shown in Table 1. Friedman test (p value < 0.01) shows that significant performance difference exists among the eight methods.

Furthermore, repeated measures analysis of variance (ANOVA) with least significant difference (LSD) adjustment (p value < 0.05) was used to test the statistical significance of the differences of Acc, AUC and F1 among these methods. Table 2 shows the means and standard deviations of accuracies of WACSP, CSP, PCA, BP, ICA, xDAWN, TF and WM and the results of their pair-wise comparisons. The accuracies of WACSP are significantly higher than those of the other methods. The statistical inference is consistent with the intuition from the left subgraph of Fig. 3. No significant difference of accuracy exists among CSP, ICA, PCA, BP, xDAWN and WM. The accuracies of CSP, ICA, PCA, BP, xDAWN and WM are significantly higher than those of TF.

Table 3 presents the means and standard deviations of AUCs of the eight methods and the results of their pairwise comparisons. The AUCs of WACSP significantly exceed those of the other methods. The rough conclusion from the middle subgraph of Fig. 3 is further verified by the statistical inference. No significant difference in AUC is discovered among CSP,PCA, xDAWN and WM. The AUCs of CSP, PCA, xDAWN and WM are significantly higher than those of BP, ICA, and TF. The AUCs of BP are significantly higher than those of ICA and TF. The AUCs of ICA are significantly higher than those of TF.

Likewise, Table 4 exhibits the means and standard deviations of F-measures of the eight methods and the results of their pair-wise comparisons. The F-measures of WACSP are significantly higher than those of the other methods. The statistical inference further verifies the intuitive conclusion from the right subgraph of Fig. 3. No significant differences in F-measure appear among CSP, PCA, xDAWN and WM. The F-measures of CSP, PCA, xDAWN and WM are significantly higher than those of BP, ICA and TF. No significant differences of F-measure are discovered among BP, ICA and TF.

Fig. 4
figure 4

ErrP waveforms (error-minus-correct) on the channels of Fz, FCz and Cz and the average waveforms of the three for a subject

Fig. 5
figure 5

The changes of coefficients of determination over time and their distributions on scalp

The ErrP waveforms of a subject on Fz, FCz and Cz are presented in Fig. 4. Compared with the ErrP waveforms in [10], the negative peak at 670 ms and the positive peak at 840 ms in Fig. 4 are extra. However, each peak in the ErrP waveforms has its counterparts in Fig. 1 in [39], except that the latencies in Fig. 4 are about 100 ms later. As for the latency difference, it is known that the latencies vary across subjects. On the other hand, the onsets of stimuli in the different tasks are specified in different ways, probably leading to the latency difference. In summary, the comparisons show that similarity and dissimilarity in the ErrP waveforms both exist in the related literature.

Figure 5 shows \(R^2\) of the subject graphically. The first row of Fig. 5 presents the curves representing the changes of coefficients of determination over time. The two peak regions of the curves are marked in red color. The second row presents the brain maps drawn using the coefficients of determination on all channels at the first peak moments. The brain maps in the third row correspond to the second peak moments. Before drawing the pictures, we divided the training set of the subject into five parts, randomly and equably, and obtained five training subsets by deleting one part from the training set every time. The five columns of Fig. 5 respectively correspond to the five training subsets. In the first row of this figure, the two peak regions of all five training subsets, which are marked in red color, are very similar. Likewise, only very small differences exist among the brain maps in the second row, and among the brain maps in the third row as well. This reveals that the spatio-temporal differences in EEG between correct and incorrect responses are stable for a subject. Additionally, in Fig. 5, the spatial distribution patterns presented by the brain maps in the third row are significantly different from the counterparts of the second row, indicating the change of the spatial pattern of ErrP over time. To adapt to this situation, our method intends to search several time windows for a subject and build respective spatial filters in each time window.

5 Discussion

As for the feature extraction of ErrP, many methods, such as WM, CSP, PCA, BP, ICA, xDAWN and TF, have been investigated [9, 21,22,23, 26,27,28,29,30,31]. However, ErrPs with dissimilar latencies on different brain areas have been discovered [4, 6] and we haven’t totally understood ErrP as far [3]. As a result, the methods, which were developed for other tasks of pattern recognition, probably miss the point for the feature extraction of ErrP, even though they seemingly work.

Based on extensive observations, we think that, for a subject, the spatial pattern of ErrP is stable in a time window although it changes from one time window to another. The point of the feature extraction of ErrP is to find the time windows, in which the EEG differences between correct and incorrect responses are significant, and to capture the spatial pattern of EEG difference in each time window. Our method WACSP was developed on the basis of this idea. We introduced the coefficient of determination to measure the EEG differences and guide the search of the time windows, and further used CSP [26, 27] for each time window to obtain a matrix projecting the EEG signals in the time window to a few virtual channels.

Among the methods of WM, CSP, PCA, BP, ICA, xDAWN and TF, CSP and xDAWN are more similar to WACSP. The difference between WACSP, CSP and xDAWN deserves further investigation. CSP and xDAWN estimate their spatial filters directly from the given EEG signals, while the proposed WACSP estimates the spatial filters in several time windows, which are mined using the guidance of the determination coefficients that reflects EEG dissimilarities under different conditions.

We tested WACSP and the commonly used methods on the data sets that were built using the EEG signals acquired during the P300 BCI experiments with feedback. The result presented graphically in Fig. 3 shows that WACSP obviously outperforms such commonly used methods as WM, CSP, PCA, BP, ICA, xDAWN and TF. A series of statistical inferences shown in Tables 1, 2, 3 and 4 further confirm the advantage of WACSP over the alternative methods. Conversely, the superiority of WACSP verifies our idea that, for a subject, the spatial pattern of EEG difference between correct and incorrect responses in a time window keeps stable even though it varies from one time window to another.

Additionally, we also compared the performances of WM, CSP, PCA, BP, ICA, xDAWN and TF. In Table 1, the average ranks of WM, CSP and PCA (2.67, 3 and 3.33) are approximate and they are followed by that of xDAWN (5). According to Tables 2, 3 and 4, no significant differences in Acc, AUC and F1 values among WM, CSP, PCA and xDAWN are observed. The Acc, AUC and F1 values of WM, CSP, PCA and xDAWN are significantly higher than their counterparts of BP and TF. A commonality of CSP, PCA and xDAWN is that they all consider the spatial distribution of EEG differences between correct and incorrect responses during the feature extraction of ErrP. Thus, we infer that the spatial pattern of EEG differences is very important for detecting ErrP. On the other hand, the cognitive fundamentals implied by the feature extractions of ErrP are also our concern.

The improvement of P300 BCI involves many aspects. In this task, using ErrP detection is one of the directions of exploration. We acquired the data in the experiments of P300 BCI as shown in Fig. 2 and studied the performance of ErrP detection in the data sets. The results show that the proposed method, WACSP, is a good option for feature extraction in single-trial ErrP detection and it is feasible to integrate ErrP detection into online P300 BCIs to improve the information transfer rate. We will go ahead along this road in the future.

6 Conclusions

In summary, there are several time windows for a subject, in which the spatial difference patterns of EEG between correct and incorrect responses are significant and stable but the patterns vary from one time window to another. Based on the idea, WACSP for detecting ErrP in P300 BCI was developed by combining coefficient of determination and CSP. We tested WACSP and compared it with the commonly used methods on the data sets that were constructed using the EEG signals acquired during the P300 BCI experiments with feedback. The intuitive comparisons and statistical analysis of accuracy, AUC and F-measure between WACSP and the commonly used methods both show that WACSP significantly outperforms the commonly used methods. The results show the superiority of WACSP and meanwhile verify the idea underlying WACSP.