1 Introduction

The motor imagery (MI) paradigm is a form of brain–computer interface (BCI) that performs the imagination of a motor action without real execution, relying on the similarities between imagined and executed actions at the neural level. MI is usually measured with electroencephalography (EEG) to register brain activity on the scalp surface. Thus, assessment and interpretation of MI brain dynamics in the sensorimotor cortex may contribute to applications ranging from evaluation of pathological conditions and rehabilitation of motor functions [1, 2], motor learning and performance [3], improving the learning of different abilities [4], among others. In education scenarios, the Media and Information Literacy methodology proposed by the UNESCO covers several competencies that are vital for people to be effectively engaged in all aspects of human development [5]. Nevertheless, one of the main challenges in implementing MI practice is recognizing and identifying the imagined actions since EEG signals have substantial intra- and inter-subject variability [6].

Currently, there is an increasing interest in deep learning models that are composed of multiple processing layers of inference using data representations with multiple levels of abstraction. In discriminating physiological signals, Convolutional Neural Networks (CNN) become the leading deep learning architectures due to their regularization structure and degree of translation invariance [7], yielding an outstanding ability in transferring knowledge between apparently different tasks of classification [8, 9]. Thus, CNN models are useful in learning features related to brain imaging and neuroscience discovery [10]. Nevertheless, for applications in MI tasks, designing an available end-to-end CNN architecture remains a challenge due to several restrictions: their large number of hyperparameters to be learned increase the computational burden (being unsuitable for online processing [11]), and complicated multilayer integration to encode relevant features at every abstraction level of the input EEG data [12].

Another unresolved issue is the interpretability of results provided by CNN models [13]. That is, along with the improved accuracy, the learned features can be hard to understand within the context of the original MI paradigm. The value of neural activity interpretation becomes evident in purposes like a medical diagnosis, monitoring, and computer-aided learning [14]. As a tool in image processing, CNN architecture has been discussed for enhancing the physiological explanation of MI paradigms represented by multiple time-series (time dimension), which reflect the brain responses across the sensorimotor cortex (spatial dimension), and commonly related to \(\mu \) and \(\beta \) rhythms (spectral dimension). For representing local and global structures in CNN models, therefore, the extraction of time-series features is increasingly realized as a multi-dimensional tensor that retains the EEG data structure throughout the learning process, by adequately encoding the spatio/spectro-temporal relationships of the measured MI responses [15]. Nevertheless, CNN models should extract the structure of multi-dimensional images properly to preserve the domain information of interest. Intending to make the learned features more interpretable in MI tasks, two main aspects are to be considered to retain the spatial locality of CNN models: (i) improving the 2D-feature extraction from EEG data for feeding the CNN models, and (ii) enhancing the image-based EEG representation to integrate spatial domain knowledge with the extracted 2D spectro-temporal features.

For building 2D-maps in discrimination of MI tasks, several algorithms of feature extraction are employed in CNN models, including the following: common spatial patterns due to the high recognition rate and computational simplicity [16]; event-related synchronization to capture the channel-wise temporal dynamics of the power signal [17]; empirical mode decomposition to deal with EEG nonstationarity [18, 19]; and time–frequency planes using the Fourier and wavelet transforms are frequently extracted because they allow a more straightforward interpretation [20,21,22,23], the latter decomposition being better suited to deal with sudden changes in EEG signals. Nonetheless, the extracted 2D images tend to have substantial variability in patterns across trials due to inherent nonstationarity, artifacts, a poor signal-to-noise ratio of EEG signals, individual differences in cortical functioning (like subjects exhibiting activity in different frequency bands).

Concerning the integration of electrode montages with the extracted 2D features, topographical representations are applied, involving either local or global spline techniques to interpolate the spatial distribution of the potential field on the scalp from distributed electrode arrays. For low electrodes distributions, adequate mapping is the spherical spline interpolation [24]. One strategy of integration is to incorporate prior knowledge to optimize the neural network structure for handling the lack of significant samples in smaller datasets. For instance, pre-trained networks are used, but assuming a substantial similarity between pre-training and target sets [25,26,27]. Otherwise, some ambiguity may remain in the foolproof nature of the pre-trained network methodology [28]. In the case of MI tasks, there are very few accessible datasets having some differences in implementing the paradigm. Another integration approach is to have some form of spatial dropping algorithm to remove candidate localities known to be not engaged with the elicited brain response. Relying on the fact that motor imagery responses are directly related to electrocortical activity over the sensorimotor area, the spatial dropping can be performed either subject-independent by excluding all electrodes out of the motor cortex before training and validation [29,30,31], or by thresholding the electrode contribution after training and validation for each subject.

Here, we develop a CNN architecture with an enhanced interpretation of the spatial activity of brain neural patterns that mainly contribute to the classification of MI tasks (left, right hand, and foot). Following the approach developed by [32], the CNN framework is designed, for which we validate two commonly used techniques of feature extraction from EEG data: power spectral density and continuous wavelet transform. For preserving the spatial interpretation of extracting EEG patterns, we project the multi-channel data using a topographic interpolation. Besides, we include a spatial dropping algorithm to remove the learned weights that reflect the localities not engaged with the elicited brain response. Obtained results in a MI database show that the thresholding strategy is desirable since the highest contribution clusters over the sensorimotor area with differentiated behavior between \(\mu \) and \(\beta \) bands. The present paper’s agenda is as follows: Section 2 describes the collection of MI data used for validation. Besides, it presents the fundamentals of feature extraction of time–frequency (t-f) EEG patterns and describes the design of Convolutional Neural Networks, including the spatial dropping strategies for motor imagery classification. Further, Section 3 provides a summary of the classifier accuracy performed by the extracted t-f vectors and evaluates the interpretability of learning weights for distinguishing between MI tasks. Lastly, Section 5 gives critical insights into the performed interpretation and accuracy, and address some limitations and possibilities of the presented CNN-based framework.

2 Materials and methods

Description of MI database and preprocessing. We perform experimental validation with nine subjects (\(N_S = 9\)) of Dataset 2aFootnote 1, holding EEG signals acquired from the scalp by a C-channels montage (\(C = 22\)). Each raw EEG channel \({\varvec{x}}^c{\in }\mathbb {R}^{T}\) was sampled at 250 Hz (i.e., at sample rate \(\varDelta t{=}0.004\) s) and passed through a five-order bandpass Butterworth filter within \(\varOmega = [8,30]\) Hz. Since earlier works have shown that electrical brain activities prompted by motor tasks are frequently related to \(\mu \) and \(\beta \) rhythms [33], the spectral range is split into the following bandwidths of interest: \(\varDelta f{\in }\,\)\( \{ \mu {\in }\, [8{-}12], \)\(\beta _{{\text {low}}}{\in }\,[16{-}20], \)\(\beta _{{\text {med}}}{\in }\,[20{-}24],\)\( \beta _{{\text {high}}}{\in }\,[24{-}28] \}\) Hz.

For performing an MI task, each trial began with an acoustic cue “beep” (at 0 s), and along with a fixation cross appeared on the black screen. After 2 s (at 2 s), an arrow cue appeared for 1.25 s on the screen, pointing in one direction according to the evaluated MI task: the left (left hand), right (right hand), or down (foot). The subjects were then instructed to image the corresponding imaginary movement between 3 s and 6 s. At 6 s, the screen was black again, allowing the subjects to relax. Then, each subject performed a run of each MI task while the cross re-appeared within the time interval, starting from 3.25 to the recording end, T s. The recordings were collected in six runs separated by short breaks, performing \(N_\lambda = 72\) trials per class and each one lasting \(T=7\) s. We validated two labeled scenarios: bi-class (left hand and right hand), and three-class (left hand, right hand, and foot). Testing is carried out using only the labeled trials with the removed artifacts.

Feature extraction of t-f EEG patterns. In the first case, the feature set is extracted from the Fourier decomposition method. So, provided the EEG sample frequency \(F_s{\in }\mathbb {R}^{+}\), the power spectral density (PSD) vector \({\varvec{s}} = \{s_f{\in }\mathbb {R}^{+}:f{\in }N_B\}\), with \(N_B = l\lfloor F_s/2\rfloor \), is estimated through the nonparametric Welch’s method that calculates the fast Fourier transform (FFT) algorithm on a set of \(M{\in }\mathbb {N}\) overlapping segments, which are split from the preprocessed EEG data vector \({{\varvec{x}}^c}\). Due to the non-stationary nature of EEG data, the piecewise stationary analysis is carried out over the set of the extracted overlapping segments that are windowed by a smooth-time weighting window \({\varvec{\alpha }}{\in }\mathbb {R}^\tau \) that lasts \(\tau {\in }\mathbb {N}\) (\(\tau <T\)), yielding a set of the time segments \(\{{{\varvec{v}}}^{m}{\in }\mathbb {R}^\tau : m{\in }M\},\) where \(v^m_t{\in }\mathbb {R}\) (\(t{\in }\tau \)) is tth element of \({\varvec{v}}^{m}\). So, the t-f patterns are extracted from EEG signals through the modified periodogram vector, \({\varvec{u}} = \{u_f{\in }\mathbb {R}^+\}\), \({\varvec{u}}{\in }\mathbb {R}^{N_B}\), computed as follows:

$$\begin{aligned} u_f = \mathbb {E}\left\{ \big |\sum \limits _{t\in \tau }{{{v}^{m}_t} \exp {( {-j{2\pi t f }} )}}\big |^2:\forall m\in M\right\} . \end{aligned}$$
(1)

Thus, the resulting PSD vector is computed with spectral components defined as \({s }_f = {u_f}/({M\nu }),\) being \(\nu = \mathbb {E}\left\{ |\alpha _t|^2:\forall t{\in }\tau \right\} ,\) and \(\mathbb {E}\left\{ \cdot \right\} \)—the expectation operator.

In the second case, the feature set is extracted from Continuous Wavelet Transform (CWT) that quantifies similarity between a given equally sampled time-series at time spacing \(\delta _{t}{\in }\mathbb {R}\) and a previously fixed base function \(\psi \left( \eta \right) \), termed mother wavelet ruled by a dimensionless parameter vector \(\eta {\in }\mathbb {R}\). Namely, each time element of the CWT vector \({\varvec{{\varsigma }}}^g{\in }\mathbb {C}^T\) is extracted from the preprocessed EEG time-series \({\varvec{z}}{\in }\mathbb {R}^{c}\) at scale \(g{\in }\mathbb {R}\) by accomplishing their convolution with the scaled and shifted mother wavelet in the form:

$$\begin{aligned} \varsigma _t^f= \sum \limits _{\tau \in T}{z_{\tau } \psi ^*\left( {(\tau -t)}\delta _t,{f}\right) }, \end{aligned}$$
(2)

where notation \((^{*})\) stands for the complex conjugate.

To build a picture showing amplitude variations through time in Eq. (2), both procedures of wavelet scaling g and translating through the localized time index \(t{\in }T\) are used. As a result, the extracted wavelet coefficients provide a compact representation pinpointing EEG data’s energy distribution in time and frequency domains. Therefore, the resulting CWT vector is computed with spectral components defined as \({s}_f = \mathbb {E}\left\{ \varsigma _t^f:\forall t{\in }\tau \right\} \).

Having extracted the feature set, we further compute a real-valued representative vector, \({\varvec{\rho }}^{r,\varDelta f} {\in }\mathbb {R}^{C}\) for each trial \(r{\in }R\), with electrode elements that accumulate the spectral contribution as follows:

$$\begin{aligned} {\rho }^{r,c} = {\sum \limits _{\eta _{\min } {\le } f {\le }\eta _{\max }}{ {s}^{r,c}_{f} } } ,\, f\in \varDelta f, \end{aligned}$$
(3)

where the frequencies \(\eta _{\min }\) and \(\eta _{\max }\) determine each one of the bandwidths of interest \(f{\in }\varDelta f\), respectively, within the most discriminating MI information is assumed to be concentrated.

Then, we map the multi-channel data per patient on a 2D surface, aiming to preserve the spatial interpretation of the extracted t-f patterns. In order to preserve the distance between electrodes in the 3D plane, we compute the topographic interpolation matrix across all trials, \( \{ {\varvec{S}} ({\varvec{\rho }}^{r,\varDelta f}){\in }\mathbb {R}^{S\,{\times }\, S'}:\forall r{\in }R\}\), through the projecting matrix that maps each EEG trial field, \({\varvec{\rho }}^{r,\varDelta f}\), as a 2-D circular view (looking down at the head top) using spherical splines that sizes \((S\,{\times }\, S')\)Footnote 2, as detailed in [34].

Motor imagery classification using Convolutional Neural Networks. The proposed CNN architecture contains three learning stages: (i) convolutional layer that holds a set of kernel filters, \(\{{\varvec{K}}_i{\in }\mathbb {R}^{K\,{\times }\, K}:i{\in }I\}\) (I is the number of used kernel filters), together with the corresponding bias vectors \(\{{\varvec{b}}_i{\in }\mathbb {R}^{SS'}\}\), which are applied by a sliding window across each topographic map \({\varvec{S}}({\varvec{\rho }}^{r,\varDelta f})\), yielding the convolution feature map as below:

$$\begin{aligned} {\varvec{\varXi }}^{r,i,\varDelta f}=\gamma _1\left( {\varvec{K}}_i\otimes {\varvec{S}}({\varvec{\rho }}^{r,\varDelta f}) +{\varvec{b}}_i\right) ,\, {\varvec{\varXi }}^{r,i,\varDelta f}{\in }\mathbb {R}^{S\,{\times }\, S'}, \end{aligned}$$
(4)

where \(\gamma _1(\cdot )\) is a non-linear activation function, and \(\otimes \) denotes the convolution operator. Of note, a zero-padding method is adopted to prevent losing the feature dimension, so that the output and input sizes of convolution mapping can be the same after the zero-padding procedure.

(ii) Pooling layer that is a down-sampling stage to reduce the dimension of output neurons in \({\varvec{\varXi }}^{r,i,\varDelta f}\) through a pool operator matrix \({\varvec{\bar{K} }}{\in }\mathbb {R}^{K'{\times }K'}\), with \(K'\le K,\) aiming at decreasing the computational burden and the over-fitting issue. Then, each down-sampled map \({\varvec{\bar{\varXi }}}^{r,i,\varDelta f}\) is rearranged into a vector form \({\varvec{\bar{\xi }}}^{r }{\in }\mathbb {R}^{G G'IN_f}\) (with \(G\le S,G'\le S'\)) by concatenating all matrix rows across \(\varDelta f\) and i domains.

(iii) A fully connected stage that includes a neural network with all neurons \({\varvec{h}}^r(q){\in }\mathbb {R}^{N_h(q)}\) connected directly to the outputs of preceding layer \(q{-}1\) as follows:

$$\begin{aligned} {\varvec{h}}^{r}(q)=\gamma _2\left( {\varvec{W}}(q){\varvec{h}}^{r}(q-1)+{\varvec{\beta }}(l)\right) ,\, q=\overline{2,Q}, \end{aligned}$$
(5)

where \({\varvec{h}}^{r}(1)\, =\, {\varvec{\bar{\xi }}}^{r}\), \({\varvec{W}}\), sizing \({G G'IN_f{\times }N_h(q)}\), is the weighting matrix that contains the connection weights between the preceding neurons and the hidden units \(N_h\) of layer q, \({\varvec{\beta }} (q){\in }\mathbb {R}^{N_h(q)}\) is the bias neuron, and \(\gamma _2(\cdot )\) is an activation function.

As a result, we obtain the output vector set \(\{{\varvec{y}}^{r} = {\varvec{h}}^{r}(Q)\},\) with \({\varvec{y}}^{r}{\in }[0, 1]^{N_\lambda }\), representing \(N_\lambda \) mutually exclusive classes, so that the last layer is tied to the output dimension (\(N_h = N_\lambda \)).

Due to the CNN-model training back-propagates the discriminating information, through the tied weights, from the hidden spaces in the input data, we propose to assess the relevance of input feature mappings, employing the matrix \({\varvec{W}}(q){\in }\mathbb {R}^{D{\times }N_h}\) that holds the row vectors \({\varvec{w}}_d^{q}{\in }\mathbb {R}^{N_h}\) with \(D{=}G G'IN_f\). Based on the fact that each \({\varvec{w}}_d^{q}\) measures the contribution of input features to build the hidden space \({\varvec{h}}^{r}(q)\), the relevance of d-th feature is assessed as the generalized mean of its corresponding reverse projection vector, that is, \(\varrho _d^{q} = \Vert {\varvec{w}}_d^{q}\Vert _{p}\), yielding the vector \({\varvec{\varrho }}^{q} = \{\varrho _{d}^{q}{\in }\mathbb {R}^{+}; \forall d{\in }D\}\), where notation \(\Vert \cdot \Vert _p\) stands for \({l}_p\)-norm. The obtained relevance vector \({\varvec{\varrho }}^{q}\) is reshaped into an estimated feature mapping matrix \(\widetilde{{{{\varvec{\varTheta }}}}} {\in }\mathbb {R}^{S{\times }S'}\) that is computed for each \(\varDelta f\) as follows:

$$\begin{aligned} \widetilde{{{{\varvec{\varTheta }}}}} = \phi \left( \mathbb {E}\left\{ \widetilde{{\bar{{\varvec{\varXi }}}}}_i:\forall i{\in }I\right\} \right) , \end{aligned}$$
(6)

where \(\widetilde{{\bar{{\varvec{\varXi }}}}}_i{\in }\mathbb {R}^{G{\times }G'}\) is the reconstructed feature mapping for i-th kernel filter, and \(\phi (\cdot )\) is an extrapolation operator that maps from \(G{\times }G'\rightarrow S{\times }S'\). In this way, the obtained \(\widetilde{{\varvec{\varTheta }}}\) highlights the spatial discriminative information projected from topographic maps.

3 Experiments

We validate the proposed CNN-based MI classification framework by appraising the following procedures: (i) preprocessing and extraction of t-f planes, evaluating the extraction methods of power spectral density and continuous wavelet transform, for which the corresponding parameter tuning is carried out; (ii) tuning of CNN architecture for MI discrimination, evaluating the spatial dropping algorithm proposed for preserving the interpretation of the extracted 2-D features. Two approaches for dropping are appraised: removing all electrodes out of the sensorimotor area before training and validation, and thresholding the electrode contribution after training and validation.

Extraction of t-f feature patterns. Each channel recording, \({\varvec{x}}^{c}{\in }\mathbb {R}^{T},\) is split into \(N_\tau = 5\) segments, \(\{{\varvec{x}}^{c}{\in }\mathbb {R}^{\tau },\, \tau <T\}\), using a sliding window approach with a segment length \(\tau = 2\) s with overlap \(\delta \tau = 1\) s. Within each segment \({\varvec{x}}^{c}\), PSD estimates are computed, fixing the following parameters: \(\tau = 256, {\delta _\tau = 0.9\tau }\). Likewise, we compute the CWT vector \({\varvec{{\varsigma }}}^g\), selecting the Morlet wavelet as \(\psi \) that is frequently used in spectral analysis of EEG signals [35]. So, we extract the continuous wavelet coefficients within each time segment using a complex Morlet wavelet, adjusting the scaling value to \(g = 16\) and the sampling period to \(1/\varDelta t\).

For either method of feature extraction, we perform validation in four different scenarios for spectral bandwidths of interest \(f{\in }\varDelta _f\): A) \(\mu \), B) \(\beta \), C) \(\mu \cup \beta \), and D) \(\mu \cup \beta _{{\text {low}}}\cup \beta _{{\text {med}}}\cup \beta _{{\text {high}}}\).

Proposed CNN architecture for MI discrimination. The adopted multiple input CNN model is based on the non-sequential Wide&Deep neural network [36] that performs learning of deep patterns (using the deep path) under simple rules (through the short path), having the following units (Fig. 1):

Fig. 1
figure 1

The proposed CNN structure that is based on Wide&Deep neural network handling multiple inputs. The first layer (IN1) is the input, the second (CN2) and third layers (MP3) are hidden and accomplish the feature mapping generation, while the next block (ranging from the output of layer CT4 to the OU9 layer) comprises the classification stage

  • IN1: Input layer that holds an image set sizing \(42\, {\times }\, 56\).

  • CN2: Convolutional layer (first hidden layer). We use two spatial filters that perform two resulting feature maps, sizing \(42 \,{\times }\, 56\). Each convolution kernel has a size of \(3 \,{\times }\, 3\), using a stride of one sample. In addition, this layer incorporates a rectified linear unit ReLU through the activation function \(\gamma _1(\cdot )\) [37].

  • MP3: Max-pooling layer (second hidden layer). This layer sub-samples the resulting mapping that picks up the maximum value of each feature map to reduce the number of output neurons, also using a stride of one sample. Thus, each feature mapping in CN2 is down-sampled using a pool size of \(2 \,{\times }\, 2\), resulting in a matrix of size \(21 \,{\times }\, 28\).

  • CT4: Concatenate layer, linking together of all resulting MP3 feature maps into a single block.

  • Fl5: Flatten layer that arranges the set of concatenated feature maps from CT4 into a single 1D array. So, the map is vectorized into a one-dimensional array of size \((21)(28)(2)(4) = 1176\) points, resulting from 2 spatial filters, and 4 bandwidths of interest.

  • Batch normalization (BN) layers (BN6 and BN8) that address the vanishing and exploding gradient problems presented in fully connected networks. To cope with this issue, all inputs of the previous layer at each batch are zero-scored, holding the mean activation close to 0 and the activation standard deviation close to 1.

  • FC7: Fully connected layer (third hidden layer) that is linked to each neuron of OU9, holding \(h_u\) neurons for which the weight values are regularized through the parameters (\(l_1\), \(l_2\)) using the Elastic Net regularization. According to [38], Elastic Net is used for preventing over-fitting by penalizing a model having large weights, and can be used more naively, e.g., when little prior knowledge is available about the dataset. This layer uses a rectified linear unit ReLU as the activation function \(\gamma _2(\cdot )\). The following parameter setting of FC7 is fixed:

  • Number of neurons are fixed through an exhaustive grid search within \(h_u = [50,100,\ldots ,550]\).

  • The learning rate is fixed at \(lr = \exp ({-3})\).

  • The optimizer used is the Adam algorithm and the loss function used is the mean squared error (MSE).

  • The regularization parameters \(l_1\) and \(l_2\) are tuned by a grid search around [0.001, 0.01, 0.1].

  • OU9: Output layer having two or three neurons, each one representing either task label (left hand, right hand or foot). This layer is fully connected to FC7 and uses the softmax procedure as the activation function \(\gamma _2(\cdot )\).

Evaluating metrics of classifier performance. As a measure of performance, the classifier accuracy \(a_c{\in }\mathbb {R}[0,1]\) is computed as follows:

$$\begin{aligned} a_c =\frac{T_P + T_N}{T_P + T_N + F_P + F_N}, \end{aligned}$$
(7)

where \(T_P\), \(T_N\), \(F_P\) and \(F_N\) are true-positives, true-negatives, false-positives, and false-negatives, respectively.

Besides, the kappa value, \(\kappa {\in }\mathbb {R}[0,1]\), is computed to evaluate the accuracy performance when removing the impact of random classification as follows [39]:

$$\begin{aligned} \kappa = ({a_c-p_e})/({1-p_e}),\, \end{aligned}$$
(8)

where \(p_e = 0.5\) for bi-label problems.

A cross-validation scheme is performed to evaluate CNN-based classifier performance. Thus, the set of training trials per subject is randomly partitioned using a stratified tenfold cross-validation to generate the set of validation trials. This procedure is repeated ten times by shifting the test and training dataset.

4 Results

Performed bi-class accuracy of extracted t-f planes. Initially, we discuss the classifier performance of the computed PSD vectors of contribution, \({\varvec{\rho }}^{r,\varDelta _f}\). In each one of the tested scenarios for spectral bandwidths of interest, parameter tuning is carried out to reach the maximum accuracy within the MI interval \([3{-}5]\) s. As seen in Table 1, the use of only one rhythm (\( \mu \) or \( \beta \)) is not sufficient to reach the best values of accuracy. Moreover, the \(\beta \) waveform drops to \(80\%\). Their combination \(\mu \cup \beta \) barely helps the classifier rule. Thus, the last validating scenario (i.e., D) reaches the best performance on average across all subjects, meaning that the inclusion of more detailed information of \(\beta \) sub-bands allows improving the accuracy of PSD vectors. Concerning the individual performance, the subjects A02T, A01T, A04T, and A05T achieve the lowest values, while A08T, A09T, and A03T accomplish the best results. Regarding the CWT-based contribution vectors, the bottom part of Table 1 shows that the use of every spectral bandwidth scenario allows enhancing the performed results, but without statistical difference between them when averaging across the subject set. Furthermore, the bi-class accuracy of CWT-based vectors is comparable to that obtained by the best case of PSD-based extraction vectors, having a very similar ranking of individual performance.

Table 1 Performed bi-class accuracy within the MI segment using the whole electrode montage (\(C{=}22\))
Fig. 2
figure 2

Dependence of CNN hidden units and the individual bi-class accuracy. Label “optimal” is the individually tuned CNN model

In terms of the tuned CNN parameters, their values averaged across the subject set show that the training scenario, achieving the best accuracy (\(\mu {\cup } \beta _{{\text {low}}} {\cup } \beta _{{\text {med}}} {\cup } \beta _{{\text {high}}}\)), demands from the PSD-based vectors more hidden units \(h_u\) than in the case of CWT planes. A similar situation holds in the scenario \(\mu {\cup }\beta \) that also performs high accuracy. When extracting the t-f vectors from a single rhythm (\(\mu \) or \(\beta \)), the PDS-based representation demands less hidden units but achieves lower accuracy.

Figure 2 displays the dependency of CNN hidden units on the obtained bi-class accuracy. Compared to the best score achieved by the individually tuned value of \(h_u\), the deterioration in performance is noticeable (nearly \(5\%\)) when decreasing the number of units in every trained CNN model. At the same time, the computational burden can reduce, on average, about a quarter time. Moreover, the variations in accuracy by changing the amount of \(h_u\) indicate a similar complexity for both measured extraction approaches.

Fig. 3
figure 3

Reconstructed topographic maps for the bi-class experiment D including all bandwidths across the time domain in the best (A08T) and the worst (A02T) subjects, left and right column, respectively

Interpretability of brain areas activated by MI tasks. Intending to give the interpretability of the extracted input t-f vectors, we represent the feature mapping graphically (topoplot) to highlight the spatial distribution of the assessed discriminative ability. Each topoplot depicts the proposed assessment \({\varvec{\tilde{\varTheta }}} ({\varvec{\rho }}^{r,\varDelta f})\) computed in Eq. (6) in which we reconstruct the input feature image from the trained CNN weights to estimate the contribution of the electrodes, under the assumption that the higher the reconstructed weight, the more critical the discriminating strength between the electrodes. Of note, the interpolated values falling out of the electrode space are assumed as meaningless. This situation may arise because the network initializes the weight set with random values, including the background pixels. Therefore, the variability and reduced signal-to-noise rate result in false augmentation of background localities, as subjects reach low discrimination ability.

The top row of Fig. 3 displays the PSD-based spatial distribution reconstructed for the best training scenario (D) within each time segment. As seen, the topoplots of A02T (the worst individual) present the spectral bandwidths contributing much alike with values mainly spread all over space, including places outside of the electrode space. In addition, the contribution estimates are low and tend to be noisy. Another fact to mention is that brain activity notably increases within the last time segment, for which the MI activity is thought to have already vanished. By contrast, the best-achieving subject A08T has some relevant localities, which gather in places of either brain hemisphere and within the MI interval, fading at the time window \( [4{-}6]\) s.

In turn, the bottom row depicts the CWT-based topoplots assessed by the same training scenario (i.e., D), showing that the obtained spatial distribution of A02T still presents the spectral bandwidths that contribute similarly. However, several spatial clusters appear, and the amount of meaningless estimates decreases. Nevertheless, a notably enhanced topographic representation is performed by A08T, for which the CWT-based vectors result in values adequately accommodated within the electrode space, regardless of the window time. Furthermore, the contribution concentrates on the electrode neighborhoods clearly defined, changing over time. Thus, the \(\mu \) rhythm shows that the sensorimotor electrodes contribute the most, being more evident their importance at the window \([3{-}5]\) s, right at the MI period.

Fig. 4
figure 4

Individual relevance weights performed by the tuned CWT feature extraction for the bi-class scenario C: Without spatial dropping (left column), spatial dropping by removing all sensorimotor electrodes out (middle column), spatial dropping by excluding nonrelevant electrodes (right column)

Figure 4 (left column) displays the topoplots individually computed for the CWT-based feature extraction under the scenario C (i.e., \(\mu {\cup }\beta \)), showing that the brain activity tends to gather over some electrodes in most of the subjects. Also, the brain activity between neighboring time windows changes smoothly, at least in subjects performing high accuracy (i.e., A03T and A08T). As the discrimination ability of individuals decreases, the topographic representations become more blurred, meaning that the learned weights are still severely affected by the variability captured by EEG data. This situation is more visible in A05T (performing the worst) with much learning weights out of the scalp area, evidencing that the CNN model is likely to be overtrained.

Performance of spatial dropping strategies. Two approaches are evaluated—(i) removing all electrodes out of the sensorimotor area before training and validation, and (ii) thresholding the electrode contribution after training and validation.

The first spatial dropping strategy is implemented by simply including all electrodes belonging to the motor cortex region (that is, C3,9,10,11,C4,14,15,16,17,18), following the spatial electrode distribution reported by [40]. Figure 4 (middle column) depicts the estimated topoplots of the two best and worst-performing subjects, showing that the brain activity gathers more prominently over some lateral sensorimotor electrodes in most of the subjects. Moreover, the brain activity between neighboring time-windows changes smoothly with the highest contribution within the segments of MI ([\(2{-}4\)] and \([3{-}5]\) s). In the first couple of subjects (A08T and A03T), the contribution of either rhythm (\(\mu \) or \(\beta \)) differs. Besides, the number of learning values out of the scalp is considerably smaller than in the previous case. Still, the topographic representations of the subjects with the worst accuracy (A02T and A05T) remain blurred.

Table 2 Bi-class accuracy of evaluated CNN training strategies, using the CWT-extracted vectors and either dropping strategy: CWT* with sensorimotor electrodes and CWT** with thresholding. In all compared cases, both sub-bands (\(\mu \) and \(\beta \)) are included and the CNN parameters are tuned individually

Concerning the second dropping strategy, Fig. 4 (right column) represents the thresholded values, showing the presence of several electrodes with a relevant contribution. Thus, the top pair of subjects holds the learned weights located on the lateral zones, having the highest contribution near the sensorimotor area with differentiated behavior between \(\mu \) and \(\beta \) rhythms. As expected, the central localities near the longitudinal fissure have zero-valued weights. However, as the individual performance decreases, the number of relevant electrodes increases due to the increased variability. Moreover, the variance of the captured EEG data for the worst-performing subjects is so strong that they have a distorted topoplot with values out of the scalp. Still, these subjects present relevant electrodes, unlike the previous approaches achieved.

Table 2 summarizes the bi-class performance achieved by each evaluated CNN-based framework, showing that every subject reaches a performance above \(\sim 75\%\). All achieved accuracy scores are competitive with other values performed by CNN-based approaches recently presented for motor imagery classification (left and right hand). It is worth noting that the use of either spatial dropping strategy results in small degradation of classifier accuracy or \(\kappa \) value.

Performance of three-class MI tasks. Further, we evaluate the proposal in a more complicated classification scenario, conducting testing for the following three-class discrimination framework of motor imagery tasks: left hand, right hand, and foot. Table 3 summarizes the classifier performance reported by two state-of-the-art approaches to the three-class discrimination, showing that the proposed approach provides very competitive outcomes (above 71%) and enhancing the accuracy of the low-performing subjects. One aspect to remark is that the values of multi-class accuracy and \(\kappa \) tend to fall, compared to the bi-class scenario, partially because of the small database evaluated.

Table 3 Three-class accuracy of evaluated CNN training strategies using the CWT-extracted vectors for the considered strategies: without spatial dropping (CWT), spatial dropping by removing all sensorimotor electrodes out (CWT*), spatial dropping by excluding nonrelevant electrodes (CWT**)
Fig. 5
figure 5

Relevance weights computed for representative individuals (best-performing A03T and A08T and worst-performing A02T and A05T) in terms of performed three-class accuracy for scenario C: without spatial dropping (left column), spatial dropping by weighing only all sensorimotor electrodes (middle column), spatial dropping by excluding nonrelevant electrodes (right column)

As in the binary classification task, we analyze the interpretability of brain areas activated by each MI task based on the reconstruction of the learned CNN weights. When assessed by the CWT-based feature extraction, Fig. 5 displays the reconstructed topoplots of scenario C, having a more distinct electrode contribution than in the bi-class case. If not using the spatial dropping, the dyad of the best-performing subjects shows an increase in neural activity within the motor imagery interval (\([3{-}5]\) s). However, this behavior is not evident in the worst-performing pair. Moreover, subject A02T has a response postponed to the segment (\([4{-}6]\) s).

In the next case of sensorimotor dropping, the middle column shows that the better the accuracy, the more compact the electrode contribution. Thus, the method assesses the motor cortex’s regular contribution through the whole record, regardless of the evoked activity. This kind of scattered representation implies high intrasubject variability. Then, the spatial dropping by excluding nonrelevant electrodes (right column) enhances the interpretation of the learned CNN weights, yielding a lower number of contributing electrodes, but more meaningful.

Lastly, we evaluate the significance of learning CNN weights in terms of the disagreement of performing the individual accuracy, using the considered dropping strategies: spatial dropping by weighing only all sensorimotor electrodes (CWT*) and spatial dropping by excluding nonrelevant electrodes (CWT**). To this, we conduct the paired Welch’s t-test, employing the scores achieved on the cross-validation folds and holding a significant level of a p-value\(<0.05\). In this case, the non-rejection of the null hypothesis (identical average scores) is the desired behavior to prove that our relevance approaches (CWT* and CWT**) do not differ from CWT (without spatial dropping). Table 2 shows that only a couple of subjects (namely, A08T with \(p = 0.159\) and A07T with \(p = 0.133\)) that are underlined are close to \(p<0.15\). In turn, Table 3 presents a confident difference in performance for subject A02T (underlined subject) with a \(p = 0.039\). This result may be explained since A02T reports a low-performance with high variability along the folds. Hence, the sensorimotor region is not sufficient to code discriminant information about this subject.

5 Discussion and concluding remarks

We present an approach using CNN models to improve the interpretability of spatial contribution in discriminating between MI tasks, preserving an adequate classification accuracy. The results obtained for BCI Dataset 2a prove that the proposed deep learning framework allows improving accuracy along with revealing the electrodes with higher spatial relevance. Nevertheless, the following aspects are to be regarded in the framework implementation:

Feature extraction of t-f vectors. For each estimated source, the t-f sets are extracted within each time window, generating an image containing temporal, spectral, and spatial information. Intending to deal with the nonstationary EEG nature, we evaluate the extraction of t-f patterns from the FFT-based periodogram and continuous wavelet transform. Then, all extracted t-f feature patterns are further interpolated to obtain the spatial distribution of activated brain areas through topographic maps. We obtain that both approaches are similar in terms of providing classifier performance and the complexity of implementing CNN models. Besides, we evaluate four combining scenarios of \(\mu \) and \(\beta \) rhythms, which differently influence the achieved accuracy. In the case of PSD estimates, only the inclusion of detailed information from three \( \beta \) sub-bands together with \( \mu \) waveform provides the best system accuracy. By contrast, the CWT-based feature set gives high accuracy scores regardless of the evaluated sub-band combination. This result may be explained by the fact that CWT is more suitable for the decomposition of nonstationary data.

Nonetheless, the CWT-based vectors are preferable for interpretation purposes because the learned weights gather around electrode neighborhoods, forming more clearly defined spatialities with relevant neural activity. Moreover, the CWT-based weights smoothly change over time following the implemented MI paradigm timing. One more aspect of highlighting is that the learned weights are less sensitive to the overtraining effect.

Spatial interpretability of activated MI responses. Another aspect to remark is the dropout algorithm that CNN models include. Their high number of parameters makes them particularly prone to over-fitting, demanding the use of regularization methods. In addition, neural network training or inference can involve randomly modifying parameters [45]. To cope with this issue, the spatial dropout algorithm can withdraw an entire feature map across a channel since adjacent pixels are highly correlated to the dropped pixels [46, 47].

Relying on the fact that the interpretation of evoked brain zones can be performed by preserving spatial information in input multi-spectral images, we evaluate two spatial dropping strategies to promote discarding of irrelevant image details: including just the sensorimotor electrodes, and thresholding of the electrode contribution. Although the number learned values out the scalp decreases considerably in the former strategy, the topographic representations of subjects having low accuracy are still blurred, hindering the interpretation of analyzed brain activity. The use of full-set EEG electrodes has been already reported as difficult to achieve in practical MI applications, suggesting that the performance of CNN models can improve with fewer electrodes, which cover the motor cortex and sensorimotor cortex [48]. The obtained results show that the thresholding strategy is desirable since the highest contribution clusters over the sensorimotor area with differentiated behavior between \(\mu \) and \(\beta \) bands. However, the high EEG data variability captured by the worst-performing subjects may still produce distorted topoplots with values out of the scalp, making difficult their understanding.

Evaluated CNN architecture for MI discrimination. The first design consideration is the number of convolutional layers, together with the type of end classifier. In MI tasks, 70% of CNN models use a rectified linear unit (ReLU) as the layer’s activation function, while the vast majority of classifier fully connected layers employ a softmax activation function [49]. The proposed network relies on the Wide&Deep architecture for handling multiple inputs to learn deep patterns under simple rules. With the purpose of increasing the neurophysiological reliability of feature interpretation, the Classifier Block includes batch normalization applied to the convolutional outputs before and after the fully connected layer FC7, improving the performance on unseen examples [50]. We also use the Elastic Net regularization technique through the parameters (\(l_1\), \(l_2\)) for preventing over-fitting by penalizing a model with large weights.

Spatial dropping of multi-class settings. The spatial dropping algorithm is evaluated in two labeled scenarios of MI tasks: bi-class and three-class, resulting in meaningful topographic representations and performing values of accuracy very competitive with the results reported by similar CNN-based architectures. However, the achieved values of multi-class accuracy and \(\kappa \) tend to fall, compared to the bi-class scenario. This behavior can be partially explained by the small database evaluated and the reduced set of scalp electrodes.

However, some restrictions are to be mentioned: the first limitation to enhance the performance of the evaluated CNN architecture is the small size of the examined dataset that holds just nine subjects with very different variability [51]. As a result, the deterioration in performance is noticeable (nearly \(5\%\)) when decreasing the number of units in each individual trained CNN model. Moreover, the small data issue restricts the application of robust approaches in deep learning like augmentation or transfer learning, causing over-fitting. Another concern is the adequate sampling of the potential scalp field for the topographic analysis that requires a large number of electrodes [52].

As future work, to enhance the impact of tested Deep Learning models, we plan to employ datasets that hold more labeled MI tasks, fusing CNNs with different characteristics and architectures is also to be considered to learn more complex relationships between spatial patterns and extracted t-f representations, making the learned CNN weights more accessible to interpret [53, 54].