1 Introduction

The EEG signals provide information about the electrical activity in the brain which plays a vital role in many useful applications/systems designed to improve quality of life for the disabled people (Wolpaw et al. 2000). Examples of such applications are communication, neuro-prosthetic, seizure detection, sleep state classification and environmental control for disabled persons using robots and manipulators. Processing and analysis of EEG signals generally consist of a signal acquisition part followed by feature extraction and classification techniques as shown in Fig. 1.

Fig. 1
figure 1

Steps of EEG signal processing and analysis

The signal acquisition part as shown Fig. 1 can be carried out on the scalp surface or interior of the brain. In a typical EEG signal acquisition system, a number of electrodes are used as sensors to record the voltage level. These electrodes can be invasive, mounted into the skull, or non-invasive, mounted on the surface of the skull. The placement of these electrodes over a number of points facilitate recording the EEG data from more than one location on the brain. Once the data is recorded it is processed to remove noise and artefacts resulting from body movement of the subjects, outside electrical interference and electrodes pop, contact and movement (Shao et al. 2009).

To differentiate between the brain’s states the signals are transformed or filtered to find the variables that effectively separate out different states of the brain, this process is known as feature extraction. The purpose of EEG feature extraction is to generate discriminative features from channels data that can increase the variance difference between classes. The last part is to efficiently classify the EEG signals and generate decision signals. The device that uses brain signals to control and operate the environment is called the Brain Computer Interface (BCI) system (Wolpaw et al. 2002). In the last two decades, the BCI, due to its numerous benefits, has been elevated to great significance in industry and scientific institutes (Ortiz-Rosario and Adeli 2013). The main advantage of using the BCI is its ability to reduce the risk to human life and increase the quality of life (improve daily living activities) for disabled persons.

There are many sources to power a BCI system that belongs to the category of invasive and non-invasive inputs. For invasive input, where electrodes are mounted into the scalp, electrocorticograpy (ECoG) (Hill et al. 2006b), single micro-electrode (ME), micro-electrode array (MEA) (Van Gerven et al. 2009) and local field potentials (LFPs) (Moran 2010) can be used to acquire signals. Electroencephalography (EEG) (Schalk et al. 2004), magnetoencephalography (MEG) (Lal et al. 2005), Functional Magnetic Resonance Imaging (fMRI) (Weiskopf et al. 2004) and Near Infra-red Spectroscopy (NIRS) (Coyle et al. 2007) are examples of non-invasive inputs . EEG powered BCI are the most reliable and are frequently used because of their easy availability, non-invasiveness, low cost and high temporal resolution (Coyle et al. 2007; Hill et al. 2006a). A generic graph between temporal resolution and spatial resolution of different signal acquisition techniques is presented in Fig. 2.

Fig. 2
figure 2

Comparison of spatial and temporal resolution (Hill et al. 2006b)

Acquiring of EEG signals for BCI can be done in different ways. Some of the methods require an event to generate the EEG signals and some are event independent. Motor Imagery (MI) is one of the methods used to generate EEG signals that are related to motor movements. MI-based EEG signals have been applied to many BCIs applications where these signals have been controlled to open an interface with the external environment (Pineda 2005). These signals can be extracted from different positions over the brain. The most widely used system for electrode placement is the International 10–20 system, recommended by The International Federation of Societies for Electroencephalography and Clinical Neurophysiology (IFSECN). The placement of EEG electrodes is also related to the brain activities and is application dependent (Daly and Pedley 1990).

Most of the useful brain state information lies in low-frequency bands; therefore the frequency of EEG signals was divided into 6 bands. Each frequency band corresponds to different functions of the brain. The frequency range of 0–4 Hz is called the delta band and it’s usually related to the deep sleep states that require continuous attention (Sharbrough 1991). The theta band ranges from 4 to 7 Hz and normally corresponds to drowsiness or a state in which the body is asleep and the brain is awakened (Kirmizi-Alsan et al. 2006). Alpha waves (8–15 Hz) are normally found in relaxing and closed eyes state (Niedermeyer 1997). The beta band is from 16 to 31 Hz and it contains the information of active concentration and working (Pfurtscheller and Da Silva 1999). 30 Hz or more is called the gamma band which relates to sensory processing and short term memory matching (Kanayama et al. 2007). The last band, mu ranges from 8 to 12 Hz; same as the alpha band but is recorded over the sensorimotor cortex and contains information of rest state motor neurons (Gastaut 1952). To analyse certain states or illness, brain waves are processed for the proper diagnosis of a particular state.

Development of on-line BCI faces many challenges including computational cost, equipment cost and classification accuracy. To successfully address these challenges, researchers have suggested different algorithms. For example, signal pre-processing, feature extraction and selecting an appropriate classifier helps in increasing classification accuracy of a BCI. Multi-channel EEG gives a variety of possibilities to applications but specific channel selection is better suited for outcomes. Researchers usually ignore the channel selection part while developing a real-time system. The problem with not using channel selection algorithm results in noisy data and redundant channels increase the computational and equipment cost of a BCI system. Channel selection also improves or stabilizes the classification results (Faller et al. 2014). The same problem also appears while doing feature extraction. Therefore, it is vital to opt for an effective solution to select the optimal number of channels rather than using all channels for processing and classification. Some researchers have used feature selection algorithms after applying the channel selection algorithms to further improve the system performance (Handiru and Prasad 2016). Subset channels are selected based on certain criteria that usually incorporates all aspects including channel location, dependency and redundancy (Garrett et al. 2003).

It is important to note that a viable BCI system for medical applications relies heavily on efficient algorithms for accurate predictions of any disease or abnormal physiological conditions. Hence, the channel selection part plays a vital role in designing efficient algorithms. Selecting a minimum number of channels decreases the computation complexity, cost and results in low power hardware design. Considering the importance of the channel selection process in BCI systems, we studied recently developed channel selection schemes based on filtering techniques for MI-based EEG signals. The survey focuses on applications involving motor imagery applications. The explanation and discussion are supported by flowcharts and tables to present a clear understanding to the reader. Clear and informative comparison of different channel selection algorithms is given based on classification algorithm, channel selection strategy, and dataset. In addition, performance of the discussed method is also given, in terms of classification accuracy and number of channels selected, to provide some assistance to BCI programmers and researchers to choose appropriate selection algorithms for different applications.

The rest of the paper is structured as follow: Sect. 2 deals with channel selection techniques in general. Channel selection algorithms and specifically filtering techniques of channel selection for MI-based EEG are described in Sect. 3. Section 4 is devoted to discussion and conclusion.

2 Channel selection techniques

EEG-based BCI has become a hot field of research during the last few decades due to the availability of EEG signal acquisition systems. As the EEG equipment is cheap compared to other systems, it allows us to record data related to brain activity with a large number of channels. A large number of channels allows researchers to develop techniques for selecting optimal channels. The objective of these algorithms is to improve computation time, classification accuracy and identify channels best suited to a certain application or task.

The algorithms used for EEG channel selection are derived from feature selection algorithms available in the literature. Selecting the optimal subset of features is known as feature selection thus using these algorithms to select channel is known as channel selection. In channel selection, features are extracted for classification after selecting the optimal channel set. However, in feature selection, the optimal feature set supplied directly to the classification algorithm. In filtering channel selection techniques, the features extracted from the optimal channel subset may or may not produce good results. On the other hand, for wrapper and hybrid channel selection techniques, feature extraction and classification are part of the selection procedure, so they produce the best results, which, however, are classifier and subject specific.

The credibility of a feature subset is evaluated by a criterion. The number of features is directly proportional to the dimensionality of the signal, so higher dimension signals have more features. Finding an optimal subset is difficult and known to be an NP hard problem (Blum and Rivest 1989). Typical feature selection algorithms involve four stages as shown in Fig. 3. The subset generation stage is done using a heuristic search algorithm such as complete search, random search or sequential search to generate candidate subset of features for evaluation purposes. Each new candidate feature subset is evaluated and its results are compared with the previous best one according to evaluation criterion.

Fig. 3
figure 3

Feature selection process

If the new subset has better results compared to the previous one, the new candidate will take the position of the previous subset. The subset generation and evaluation process will be repeated until a stopping criterion is fulfilled. Then the selected optimal subset is evaluated using some classifier or prior knowledge of the problem.

Feature selection processes can be found in many fields of machine learning and data mining. In statistics, feature selection is also called variable selection and these techniques are also used for samples and channel selection (Liu and Yu 2005). There are four main sorts of techniques available to evaluate features subset, namely, Filtering, Wrapping, Embedded and Hybrid techniques that are discussed in further details in the following subsections.

figure a

2.1 Filtering techniques

Filtering techniques use an autonomous assessment criterion such as distance measure, dependency measure and information measure to evaluate the subset generated by a search algorithm (Guyon and Elisseeff 2003; Dash and Liu 1997). Most of the filtering techniques are high-speed, stable, providing independence from the classifier, but of low accuracy (Chandrashekar and Sahin 2014). Algorithm 1 shows a generalized algorithm for filter based channel selection algorithms. Let D be the given data set and \(S{}_{0}\) is the subset candidate generated by a search algorithm, \(S{}_{0}\) can be an empty, full or randomly selected subset and it propagates to the whole data set through a particular search strategy. An independent measure M has been used to assess the generated candidate subset S. After evaluation the subset is compared with the previous best subset and if it is better than the previous one, it is stated as the current best subset. This process continues in an iterative manner until a stopping criterion \(\delta \) is achieved. The final output of the algorithm is the last best subset denoted by \(S{}_{best}\). The commonly used methods to find relevancy and dependency are discussed.

2.1.1 Correlation criteria

The most simple and commonly used criteria to find the correlation between variable and target is Pearson coefficient (Guyon and Elisseeff 2003):

$$\begin{aligned} R(i)=\frac{cov(x_i,Y)}{sqrt(var(x_i)*var(Y))} \end{aligned}$$
(1)

where \(x_i\) is the ith variable and Y is the output target class. Linear dependencies between variable and output can be detected with correlation criteria.

2.1.2 Mutual information (MI)

Mutual information is a measure to find the dependency between two variables (Lazar et al. 2012). The Mutual Information is calculated as:

$$\begin{aligned} I(X,Y)=H(Y)-H(Y|X) \end{aligned}$$
(2)

where I is the mutual information, H(Y) is the entropy of Y and H(Y|X) is entropy of variable Y observing a variable X. Mutual information will be zero if X and Y are independent and greater than zero if they are dependent. Some researchers also use Kullback–Leibler divergence (Torkkola 2003) formula to measure mutual information between two densities. After measuring the mutual information, the next step is to rank the features through a threshold. The problem with mutual information is that it ignores the inter-feature mutual information. Another common variation of mutual information used in the literature is conditional mutual information (Fleuret 2004).

2.1.3 Chi-squared

The chi-squared statistic is a univariate filter that evaluates feature independently. The value of chi-squared is directly proportional to the relevance of the feature with respect to the class (Greenwood and Nikulin 1996). The chi-squared statistic for a feature is measured as:

$$\begin{aligned} X^2=\sum ^V_{i=1} \sum _{j=1}^B \frac{[A_ij R_i*B_j/N]^2 }{R_i*B_j/N} \end{aligned}$$
(3)

where V is the number of intervals, N is the total instances, B is the total number of classes, \(R_i\) is the number of instances in the range, \(B_j\) is the number of instances in class \(j_{th}\), and \(A_{ij}\) is the number of instances in the range i and class j. Various other techniques are used in the literature to validate feature subset and classifier for data with unknown distribution including consistency based filter (Dash and Liu 2003), INTERACT algorithm (Zhao and Liu 2007), Relief and ReliefF algorithm (Kononenko 1994), minimum redundancy maximum relevance (mRMR) (Peng et al. 2005).

Different filtering techniques can be developed by changing the search strategy for subset generation and the evaluation function for assessing the independent measure of each subset (Liu and Yu 2005). This paper focuses on filtering techniques for channel selection because wrapper and hybrid selection techniques heavily rely on the selection of classifier and the subject. However, filter techniques are subject and classifier independent as these techniques use other signal information related parameters to select the best channel subset. Most filtering techniques are based on some statistical criteria for channel selection. Measures based on location (Faul and Marnane 2012), variance (Duun-Henriksen et al. 2012), mutual information (Atoufi et al. 2009), redundancy and relevancy (Yang et al. 2013; Shenoy and Vinod 2014) are more common for filtering techniques. In MI applications, Common Spatial Pattern (CSP) based measures are mostly used to rank the channels for selection (Alotaiby et al. 2015).

figure b

2.2 Wrapper techniques

The wrapper method uses a predictor and its output as an objective function to evaluate the subset. The advantage of using wrapper techniques is the accuracy as wrappers generally achieve better recognition rates than filters since wrappers are tuned to the specific interactions between the classifier and the dataset. The wrapper has a mechanism to avoid over-fitting due to the use of cross-validation measures of predictive accuracy. The disadvantages are slow execution and lack of generality: the solution lacks generality since it is tied to the bias of the classifier used in the evaluation function. The “optimal” feature subset will be specific to the classifier under consideration (Chrysostomou 2009). Pseudo code for a generalized wrapper algorithm is given in Algorithm 2 which is quite similar to the filter algorithm except that the wrapper utilizes a predefined mining or classification model A rather than independent measure M for subset evaluation. For each candidate subset S, the wrapper evaluates the goodness of the subset by applying the model A to the feature subset S. Therefore, changing the search function that generates the subset and prediction model A can result in different wrapper algorithms (Liu and Yu 2005).

Most wrapper methods used searching algorithms to find a set of optimal electrodes. The optimization function mostly rotates around to maximize the performance and minimize the number of channels for a certain range of accuracy. Sequential forward and backward search (Shih et al. 2009; Kamrunnahar et al. 2009) as well as heuristic/random search (Millán et al. 2002; Wei and Wang 2011) are the widely used search algorithms in the literature. The following subsection presents key search algorithms in more detail.

2.2.1 Sequential selection algorithms

These algorithms iteratively search the whole feature space to select the best features. The most common algorithm was the sequential feature selection (SFS) (Pudil et al. 1994), which started with an empty set and added the feature that generates the maximum value for the objective function. In the next step, the remaining features were added individually, and the new subset was evaluated. The reverse of SFS was known as sequential backward selection (SBS) that started with all features and removed the feature whose removal minimally affects the objective function performance (Reunanen 2003). A more flexible algorithm was the sequential floating forward selection (SFFS), which introduced a backtracking step in addition to SFS (Reunanen 2003). The backtracking step removed one feature at a time from the subset and evaluated the new subset. If the removed feature maximized the objective function, the algorithm went back to the first step with the new reduced features. Otherwise, it repeated the steps until the required number of features or performance was achieved. The main problem with the SFS and SFFS algorithms was the nested effect, which meant that two features with a relatively high correlation might be included in the subset because they gave the maximum accuracy. To avoid this, adaptive SFFS (Somol et al. 1999) and Plus-L-Minus-r search method (Nakariyakul and Casasent 2009) were also developed.

2.2.2 Heuristic search algorithms

Heuristic algorithms were designed to solve a problem in a faster and more efficient way compared with the traditional methods. A heuristic is the rule of thumb that sacrifice optimality, accuracy, precision or completeness to find a solution. Usually, these algorithms are used to solve NP-complete problems. The most commonly used heuristic algorithms are genetic algorithms (Davis 1991), particle swarm optimization (Kennedy 2011), simulated annealing (Yang 2014), ant colony optimization (Dorigo et al. 2008), and differential evolution (Baig et al. 2017) etc.

A recent method that used the inconsistencies in classification accuracy after adding a noisy channel as a tool for selecting channels has been proposed (Yang et al. 2014). A predefined classification criterion was set and if such a criterion was met after including a channel, the channel would be selected. The technique used SVM, naive Bayes, LDA and decision trees classifier to test a channel. It recorded an average increase of maximum 4% in comparison with SVM, mutual information, CSP, and fisher criterion-based channel selection methods. It also found that the selected channels were mainly located on the motor and somatosensory association cortex area. In another work, Ghaemi et al. (2017) used an improved Binary Gravitation Search Algorithm (IBGSA) to automatically select the optimal channels for MI-based BCI. The algorithm used SVM as a classifier and both the time domain and the wavelet domain features were used. A classification accuracy of \(55.82 \pm 8.30\)% was achieved with all channels. The accuracy increased further to 60% after applying PCA for feature reduction. With the proposed channel selection algorithm, an accuracy of \(76.24 \pm 2.78\)% was achieved with an average of 7 channels.

Recently, researchers have been working on understanding the applications of evolutionary algorithms for channel selection. Most of the literature used the genetic algorithm and its variations. Antonio et al. (2012) proposed a channel selection algorithm for imagined speech application. The method searched for a non-dominant channel combination using a multi-objective wrapper technique based on NSGA-II (Elitist Non-Dominated Sorting Genetic Algorithm) in the first stage called Pareto front. In the next stage, the algorithm assessed each channel combination and automatically select one combination using the Mamdani fuzzy inference system (FIS). The error rate and the number of channels selected were used as a multi-objective optimization function. In comparison to the classification accuracy of 70.33% with all channels, the algorithm achieved a classification accuracy of 68.18% with only 7 channels. A multi-objective genetic algorithm (GA) was used to optimize the number of channel selected and classification accuracy. The effectiveness of GA in channel selection applications was shown by considering three different variations of GA i.e. simple GA, steady-state GA, and NSGA-II (Kee et al. 2015). The algorithms showed an increase of almost 5% in classification accuracy with an average of 22 selected channels.

figure c

2.3 Hybrid techniques

Hybrid techniques are the combination of the above two techniques and eliminate the pre-specification of a stopping criterion. The hybrid model was developed to deal with large datasets. Algorithm 3 shows a typical hybrid technique algorithm that utilizes both an independent measure M and a mining algorithm A to evaluate the fitness of a subset. The algorithm starts its search from a given subset \(S{}_{0}\) and tries to find the best subset in each round while also increasing the cardinality. The parameters/variables \(\gamma _{best} \) and \(\theta _{best} \) correspond to cases with and without classifier respectively, and are calculated in each round. The quality of results from a mining algorithm offers a natural stopping criterion (Liu and Yu 2005).

Li et al. (2011) used the L1 norm of CSP to first sort out the best channels and used the classification accuracy as an optimization function. With this method, an average accuracy of 90% was achieved with only 7 channels. The algorithm was tested on the Dataset IVa of BCI competition III. Handiru et al. proposed an iterative multi-objective optimization for channel selection (IMOCS) method that initializes the C3 and C4 channels as the candidates and updated the channel weight vector in each iteration. The optimization parameters were the ROI (motor cortex) and the classification accuracy. To terminate the iterative channel vector update function, a convergence metric based on the inter-trial deviation was used. The dataset used to evaluate the algorithm was Wadsworth dataset (Schalk et al. 2004). The proposed approach achieved a classification accuracy of 80% and a cross-subject classification accuracy of 61% for untrained subjects.

2.4 Embedded techniques

There is another type of technique known as embedded technique. In the embedded techniques, the channel selection depends upon the criteria created during the learning process of a specific classifier because the selection model is incorporated into the classifier construction (Dash and Liu 1997). Embedded techniques reduce the computation time taken up in reclassifying different subset, which is required in wrapper methods. They are less prone to over-fitting and require low computational complexity. Two commonly used embedded feature selection techniques are given below.

2.4.1 Recursive feature elimination for SVM (SVM-RFE)

This method performs iterative feature selection by training SVM. The features that have the least impact on the performance indicated by the SVM weights are removed (Chapelle and Keerthi 2008). Some other method uses statistical measures to rank the features instead of using a classifier. Mutual information and greedy search algorithm have been used to find the subset in (Battiti 1994). Kwak and Choi (2002) uses Parzen window to estimate mutual information. Peng et al. (2005) uses mRMR instead of mutual information to select features.

2.4.2 Feature selection-perceptron (FS-P)

Like SVM-RFE, multilayer perceptron network can be used to rank the features. In a simple neural network, a feedforward approach is used to update the weights of perceptron, which can be used to rank features (Romero and Sopena 2008). A cost function can be used to eliminate features randomly (Stracuzzi and Utgoff 2004).

There are some other methods that use unsupervised learning techniques to find the optimal subset of features. With unsupervised learning-based feature selection, a better description of data can be achieved. Ensemble feature selection technique is used in the literature to find a stable feature set (Haury et al. 2011; Abeel et al. 2009). The idea behind this technique is that different subsets are generated by a bootstrapping method and a single feature selection method is applied on these subsets. The aggregation method such as ensemble means, linear and weighted aggregation have been used to obtain the final results.

In this survey, we are focused on filtering techniques used in channel selection for MI-based EEG signals. A review of channel selection algorithms in different fields indicates the application of channel selection algorithm in improving performance and reducing computation complexity. Channel selection algorithms have been applied in seizure detection and prediction, motor imagery classification, emotion estimation, mental task classification, sleep state and drug state classification (Alotaiby et al. 2015).

3 Channel selection for motor imagery classification

The analysis of motor imagery is of keen importance to the patients suffering from motor injury. Such analysis can be performed through EEG signals and may involve channel selection to deduce the channels that are the most related to a specific cognitive activity, as well as to reduce the overall computation complexity of the system.

Filtering techniques use autonomous criteria such as mutual information or entropy to evaluate the channel. Filtering is a common technique used in channel selection for MI classification because these techniques are based on signals statistics. These techniques can be divided into CSP-based and non-CSP based techniques.

3.1 CSP based filtering techniques

Common Spatial Pattern (CSP) filters are often used in the literature to study MI-EEG. The reason is that CSP has the ability to maximize the difference in variance between the two classes (Koles et al. 1990).

3.1.1 CSP variance maximization methods

In the literature, researchers have used CSP filters for the maximization of variance between classes. Wang et al. (2006) used maximum spatial pattern vectors from common spatial pattern (CSP) algorithms for channel selection. The channels were selected using the first and last columns of a spatial pattern matrix as they corresponded to the largest variance of one task and the smallest variance of the other task. The features selected for classification were averaged time course of event related de-synchronization (ERD) and readiness potential (RP) because they provide considerable discrimination between two tasks. Fisher Discriminant (FD) was applied as a classifier and a cross-validation of \(10 \times 10\) was used to evaluate accuracy. The dataset used was from the BCI competition III, dataset IVa provided by Fraunhofer FIRST (Intelligent Data Analysis Group) and University Medicine Berlin (Neurophysics group) (Dornhege et al. 2004). The obtained results were divided into two parts based on the selected channels. In the first part, 4 channels were selected and achieved the combined classification accuracy was 93% and 91% for two subjects. In the case of 8 channels, classification accuracy was increased i.e. 96% and 93% for subject 1 and 2 respectively, at the cost of reducing the suitability of the system.

Another channel selection method based on the spatial pattern was developed by Yong et al. (2008). The authors redefined the optimization problem of CSP by incorporating an L1 norm regularization term, which also induced sparsity in the pattern weights. The problem was to find filters or weight vectors that could transform the signal into a one dimensional space where one class has maximum variance and other has minimum. The high variance corresponded to strong rhythms in EEG signal because of the RP and the low variance are related to attenuated rhythms, which are generated during a right hand imagined movement or when an ERD occurs. This was solved with an optimization process, in which the cost function was defined as the variance of the projected class of one signal. Such a cost function was minimized while the combined sum of variance of both classes remained fixed. On average, 13 out of 118 channels were selected for classification, which generated a classification accuracy of 73.5%. The classification accuracy by incorporating all the channels in the classification procedure was 77.3%, so a small drop of 3.8 % was recorded while reducing the channels by an average of 80%. In this method, the regularization parameter was selected manually. To produce the optimal results, it should be selected automatically.

Meng et al. (2009) presented a heuristic approach to select a channel subset based on L1 norm scores derived from CSP. The channels with a larger score were retained for further processing. The CSP was implemented twice on the signals, making it a complex optimization problem that has to be solved heuristically in two stages because sometimes CSP can be affected by over-fitting due to a large number of channels. In stage one, the L1 norm was calculated to select channels and CSP was performed on the selected channels to generated features in the second stage. The score of each channel was evaluated based on L1 norm. The channels with the highest scores only were retained for further processing. In the second stage, CSP was applied and the features were passed to the classifier.

The method was compared with the commonly used \({\gamma }^{2}\) value in which each channel was scored based on a function dependent upon the samples, mean and standard deviation of the classes (Lal et al. 2004). Support vector machine (SVM) with a Gaussian radial basis function (RBF) was used as a classifier. \(10 \times 10\) cross-validation was used to evaluate classification accuracy. The result of Meng et al. (2009) outperforms manual selection of electrodes and \({\gamma }^{2}\) value for all 5 subjects with an average classification accuracy of 89.68% with a deviation of 4.88%

3.1.2 Methods based on CSP variants

Some researchers modified the CSP algorithm for extracting the optimal number of channels. He et al. (2009) presented a channel selection method using the minimization of Bhattacharyya bound of CSP. Bayes risk for CSP was used as an optimization criterion. Calculating Bayes risk directly was a difficult task, so an upper bound of Bayes risk was applied as a substitution for measuring discriminability, which was known as Bhattacharyya bound. After finding the optimal index through Bhattacharyya bound of CSP a sequential forward search was implemented for extracting a subset of channels. Features were extracted through CSP with a dimension vector of 6. The dataset 1 of BCI competition IV was utilized in the experiment (Naeem et al. 2006). The authors selected data from subjects a, b, d and e each with 200 trials. Naïve Bayes classifier was applied for classification and an average classification accuracy of 90% with an average of \(\sim \) 33 electrodes out of 59 using 10 times 3 fold cross validation.

Tam et al. (2011) selected channels by sorting the CSP filter coefficients, known as the CSP rank. The CSP generate two filters for two classes. These spatial filter coefficients then generated new filtered signals that were considered as weights assigned to different channels by the CSP. If the weight of a particular electrode was large, then it would be considered as contributing more towards the filtered signal, and hence the electrode was considered important. The first electrode was the one that had the largest value from the sorted coefficient of class 1 filter. The second channel was from the sorted coefficients of class 2 filter. If the channel was already selected, the algorithm would move to next largest coefficient in the same class until a new channel was selected. The data was recorded from five chronic stroke patients using a 64 channels EEG headset at a sampling rate of 250 Hz over 20 sessions of MI-tasks and each session was performed on a different day (Meng et al. 2008). The proposed algorithm showed an average classification of 90% with electrodes ranging from 8 to 36 from a total of 64 with Fisher linear Discriminant (FLD) classifier. The best classification was 91.7% with 22 electrodes.

The results showed that SCSP outperformed the existing channel selection algorithms including Fisher Discriminant (FD), Mutual Information, SVM, CSP and RCSP. The algorithm attained a 10% improvement in classification accuracy by reducing the number of channels compared to the three channel case (C3, C4 and Cz). The average classification accuracy of SCSP1 was 81.63 and 82.28% with an average number of channels 13.22 and 22.6 for dataset IIa and IVa respectively. The average classification rate of SCSP2 was 79.09 and 79.28% with an average number of channels 8.55 and 7.6 for dataset IIa and IVa respectively. The time taken by the algorithm to converge was 50.1 s with 1001 iteration on Dataset IVa and 2.63 s with 454 iterations on dataset IIa. The \(10 \times 10\) cross-validation was used to evaluate classification performance with SVM classifier.

One of the hurdles encountered in analysing an EEG signal is its non-stationary nature. EEG signals differ from the session to session due to the preconditions of the subject. So there is a need of adaptive or robust algorithms to tackle these variations. Arvaneh et al. (2012) presented a Robust Sparse CSP (RSCSP) algorithm to deal with session channel selection problems in BCI. The pre-specified channel subset selection was based on experience. They replaced the covariance matrix in SCSP with a robust minimum covariance determinant (MCD) estimate that involved a parameter to resist the outlier.

Table 1 Summary of CSP based channel selection method for motor imagery EEG

To calculate the credibility of proposed algorithms, a comparison was carried out with five existing channel selection algorithms. The data was recorded from 11 stroke patients across 12 different sessions of motor imagery. Data was recorded over 27 electrodes at a sampling rate of 250 Hz (Ang et al. 2011). Eight electrodes were shortlisted with the RSCSP algorithm and the same channels were used across 11 other sessions. The results showed that the proposed algorithm outperformed the others such as SCSP by 0.88%, CSP by 2.85%, Mutual Information (MI) by 2.69%, Fisher Criterion (FC) by 4.85% and SVM by 4.58% over all 11 subsequent sessions. The overall average classification accuracy was 70.47% for RSCSP, which was not as good as it should be. A possible reason could be not using a search strategy to get an optimal subset of channels. Instead, subset definition were generated based on experience. Table 1 shows the summary of CSP based channel selection algorithms for motor imagery EEG applications.

3.2 Non-CSP based filtering techniques

Some researchers have proposed non-CSP based algorithms for MI channel selection. Some of the recent work is summarized in this section.

3.2.1 Information measure based methods

Information measures have been used frequently in selecting channels for EEG applications. He et al. (2013) proposed an improved Genetic Algorithm (GA) that involved Rayleigh coefficients (RC) for channel selection in motor imagery EEG signals. Maximization of Rayleigh coefficients was performed not only to maximize the difference in the covariance matrices but also to minimize the sum of them. Like common spatial patterns, the Rayleigh coefficients were also affected by redundant channels. The authors proposed a Genetic Algorithm based channel selection algorithm that utilized Rayleigh coefficient maximization. The first stage of algorithm was to delete some channels to reduce computation complexity using fisher ratio. The first channels subset was constructed using the electrodes with maximum fisher ratio. In the second stage, an improved genetic algorithm was purposed that utilized Rayleigh coefficient maximization for selecting the optimal channel subset.

The proposed algorithm was assessed on two sets of data. The first data was from BCI competition III dataset IVa (Dornhege et al. 2004).The second data was recorded from a 32 channel EEG system with a sampling rate of 250Hz. The performance of the proposed algorithm was compared with other algorithms such as Sequential Forward and Backward Search (SFS and SBS) and SVM-GA. The results showed that RC-GA achieved high classification accuracy with lower computational cost. The proposed algorithm achieved an accuracy of 88.2% for dataset 1 and 89.38% for dataset 2, with the average selected channels numbering 50 and 25 respectively. It was shown that RC-GA attained a more compact and optimal channel subset in comparison to other mentioned algorithms.

Yang et al. (2013) presented a novel method for channel selection by considering mutual information and redundancy between the channels. The technique used laplacian derivatives (LAD) of power average across the frequency bands starting from 4 to 44Hz with the bandwidth of 4Hz and overlap of 2 Hz. After calculating LAD power features, maximum dependency with minimum redundancy (MD-MR) algorithm was applied.

The proposed technique was applied on a dataset collected from 11 healthy performing motor imagery of walking. LAD was extracted from 22 channels selected symmetrically. Among these 22 channels, 10 channels were selected using the proposed technique MD-MR. The results were compared with other algorithms such as filter bank common spatial pattern (FBCSP), filter bank with power features (FBPF), CSP and sliding window discriminant CSP (SWD-CSP). A \(10 \times 10\) cross-validation was used to evaluate results. The results showed an increase in classification accuracy of 1.78% and 3.59% compared to FBCSP and SWD-CSP respectively. The results were recorded from 4, 10 and 16 selected LAD channels with an average accuracy of 67.19% \(\pm \) 2.78, 71.45% ± 2.50 and 71.64% ± 2.67 respectively which was slightly less than the accuracy of 22 LAD channels. Overall the performance was not degraded much when fewer electrodes were used.

Shenoy and Vinod (2014) presented a channel selection algorithm based on prior information of motor imagery tasks and optimizes relevant channels iteratively. The EEG signals were band-passed using a Butterworth filter of order 4 into 9 overlapping bands with bandwidth of 5Hz and an overlap of 2Hz. The overlapping bands were then transformed using Common spatial patterns filter to new subspace followed by feature selection using minimum redundancy and maximum relevance (mRMR). The channel selection algorithm was divided into two stages.

In the first stage, by utilizing prior information about MI tasks, a reference channel was chosen using regions of neuropsychological significance of interest for MI. A kernel radius was initialized in this step through which channels that lie in the kernel were assigned a weight inversely proportional to the Euclidean distance from the reference channel. The weights were updated iteratively in the second stage. Finding the most optimal channel subset was presented as an optimization problem which is solved iteratively. Weights were updated iteratively by incorporating Euclidean distance and the difference of ERD and ERS band power values into the equation. Modified periodogram with hamming window was used instead of Welch method to reduce computation cost. The results showed that the proposed algorithm produces an average classification accuracy of 90.77% and surpasses FBCSP, SCSP1 and SCSP2 with only 10 channels for dataset 1. For the second dataset, the accuracy was around 80% which was less than SCSP1.

Shan et al. (2015) developed an algorithm for optimal channel selection that involves real-time feedback and proposed an enhanced method IterRelCen constructed on relief algorithm. The enhancements were made in two aspects: the change of target sample selection strategy and the implementation of iteration. A Surface Laplacian filter was applied to the raw EEG signals and features were extracted from a frequency range of 5–35 Hz. The frequency range was decomposed into 13 partially overlapped bands with proportional bandwidth. The Hilbert transform was applied to extract the envelope from each band. The IterRelCen is an extension of Relieff algorithm. The pseudocode for Relieff algorithm was given in Algorithm 4 (Robnik-Šikonja and Kononenko 2003). IterRelCen was different than relief algorithm in two aspects.

  1. 1.

    The target sample selection rule was adjusted rather than randomly selecting target samples, samples close to the center of the database from the same class had the priority of being selected first.

  2. 2.

    The idea of iterative computation was borrowed to eliminate the noisy features. N features with the smallest weights were removed from the current feature set after each iteration.

figure d

Three different datasets were used in this research study. Dataset 1 was from the data analysis competition (Sajda et al. 2003). EEG was recorded from 59 electrodes with a sampling rate 100Hz. For the second dataset, the experiment was conducted in the Biomedical Functional Imaging and Neuroengineering Laboratory at University of Minnesota (Yuan et al. 2008). 62 channels EEG was used to record data from eight subjects at a sampling rate of 200Hz. Dataset 3 was also from the sae Lab (Yuan et al. 2008). The only difference was the four class control signal controls the movement of a cursor in four directions i.e. left and right hand, both hands and nothing moves the cursor in left, right, up and down direction.

Multiclass SVM was applied to classify the signals. Tenfold cross validation was used to evaluate results. One way ANOVA was employed to test significant performance improvement. The classification results were 85.2%, 91.4% and 83.2% for dataset 1, 2 and 3 respectively with an average number of selected channels, \(14 \pm 5\) for dataset 1, 22 ± 6 for dataset 2 and 29 ± 8 for dataset 3.

3.2.2 Other methods

Shan et al. (2012) explained that increasing the number of channels gradually improves classification accuracy. The concept was to gradually increase the number of channels and opt for a time frequency spatial synthesis model. After preprocessing, 13 overlapping sub-bands were extracted with a constant proportional bandwidth using a 3rd order Butterworth filter. The envelope of each band was extracted using the Hilbert Transform and treated as a feature because it contained power modulation information available in frequency band. The trial to trial mean was calculated and applied as the input to classification algorithm, which was performed on weighted frequency pattern.

They tested the technique on two different sets of data for comparison. It was observed from the results that increasing channels gradually raised the classification accuracy for the first dataset but not for the second one. For dataset 1, the channels are gradually increased from 2 to 62 and the classification accuracy rose from 68.7 to 90.4% for the training case and for testing data it was increased from 63.7 to 87.7%. Classification accuracy for dataset 2 was increased for the training data from 77.5 to 91.6% when channels were gradually increased from 2 to 59 but the results were bad for testing datasets. It increased from 2 till channel 16 and then it dropped to 68.9% from 81.3% when channels were increased from 16 to 59. In general, it was concluded that increasing the number of channels for on-line BCI will increase the classification accuracy instead of off-line BCI but the problem with the mentioned technique was all the envelopes correspond to different frequency bands and did not contribute towards accurate classification.

Das and Suresh (2015) proposed a Cohen’s d effect size CSP (E-CSP) based channel selection algorithm. The method eliminates channels that do not carry any useful information. The algorithm removes noisy trials from channel followed by Cohen’s d effect size calculation for channel selection. The noisy trials were eliminated by calculating z-score, which measured the distance of trial from the trial mean. More z-score means trial was far away from the mean trial and could be declared as noisy. The threshold for z-score was calculated using cross validation. Cohen’s d effect size was used to select channels. The Cohen’s d distance was calculated using the equation:

$$\begin{aligned} d_i=\frac{|\overline{C_{1i}}-\overline{C_{2i}}|}{\sigma } \end{aligned}$$
(4)

where

$$\begin{aligned} \sigma =\frac{{(\sigma }^j_{1i}+{\sigma }^j_{2i})}{2} \end{aligned}$$
(5)

\({\sigma }^j_{li}\ \)represent the standard deviation of \(l\ \)class across the selected jth trials of channels i. \(\overline{C_{li}}\) represents the mean of \(l\ \)class of ith channels. The channel was selected if d value was greater than \(\delta \).

\(L= \{i\ :d_i>\delta \};\forall \ i\) represents the set of selected channels. \(\delta \ \) was calculated using cross-validation and the recommended range for \(\delta \) is \([0.01 - 0.1]\). CSP was implemented to extract features. Two data-sets were utilized to evaluate the algorithm. One was from BCI competition IV, dataset IIa (Dornhege et al. 2004) and the other was from BCI competition III, dataset IVa (Naeem et al. 2006). SVM was used as a classifier and the results were compared with CSP, SCSP1 and SCSP2 algorithms. The results showed that for dataset 1, the proposed algorithm gained an average increment of 3% over other algorithms with an average classification accuracy of 83.61% using approximately 8 channels. For dataset 2, the average classification accuracy was 85.85% with an average of 9.20 channels. The results concluded that the algorithm increased the classification accuracy as well as decreased the number of selected channels compared to other mentioned algorithms.

Qiu et al. (2016) used a modified sequential floating forward selection (SFFS) to select channels for feature extraction. The technique utilized the neighbouring channels as a feature for selection. As the SFFS searched in a continuous loop for channel selection and required more time for the large feature set, so the author presented a solution by considering the location of the channels in the cerebral cortex. The adjacent channels were considered as features and the complete feature set had fewer features that made add or delete a channel easier for SFFS. The results showed that without affecting the classification accuracy, the algorithm could select channels in a very small amount of time. The time for channel selection was reduced by 57% for data 1 and 65% for data 2.

4 Discussion and guidelines

This paper discussed MI-EEG channel selection algorithms based on filtering techniques taking into account different factors mentioned in literature for channel evaluation and search strategy. This survey paper introduced the basics of selection algorithms along with the procedures and presents a detail description of filtering techniques used for the channels in MI-based EEG applications. The comprehensive exploration of channel selection algorithms has shown that with a little pre-computation, it is possible to achieve decent performance metrics while utilizing a small subset of EEG channels. The survey study showed that by implementing a channel selection algorithm, the number of channels can be reduced up to 80% without significant effect on classification tasks. Reducing channels will result in low computational complexity and a reduction in setup time. It also increases the maintainability of the device with respect to the subject.

Table 2 Summary of non-CSP based channel selection method for motor imagery EEG

Tables 1 and 2 present the summary of filtering techniques applied in selecting channels for MI-based EEG applications explored in Sect. 3. These techniques have been carried out on a number of different databases. In order to determine the effectiveness of an algorithm, extensive analysis will be needed to find a technique that gives best results for all MI-based EEG applications. Finally, it is observed that filtering technique for channel selection has been used in a variety of applications that used MI-based EEG signals. It can also be depicted from the summary in Tables 1 and 2 that for BCI Competition III, Dataset IVa (Dornhege et al. 2004), the best filtering technique to generate maximum performance utilized mRMR strategy for channel selection (Shenoy and Vinod 2014). It uses SVM as a classifier and generates an average classification accuracy of 90.77% with only 10 channels out of 118. For BCI Competition IV, Dataset IIa (Naeem et al. 2006), Cohen’s d effect size (Das and Suresh 2015) has shown a promising result of 83.61% with average 8.11 channels out of 22. To identify an algorithm to be the most effective is strongly dependent on the application.

Many comparative studies about channel selection techniques can be found in the literature. However, sometimes these studies lead to contradictory results if applied to some other datasets. The next subsection contains the guidelines extracted from reviewing existing literature, which can help researchers in selecting a suitable channel selection algorithm.

4.1 Adequate evaluation criteria

To choose an optimal channel subset for MI application, the criteria to evaluate a channel subset is required. There are two main types of measures that can be used as evaluation criteria, known as the Information-based and the classifier-based measures/criteria.

Information-based measures Information-based measures rely on the generic properties of the data such as channel ranking by a metric, interclass distance and probabilistic dependence to evaluate the channel subset. The measures for ranking channels included correlation coefficient (Guyon and Elisseeff 2003), mutual information (Lazar et al. 2012), symmetrical uncertainty (Kannan and Ramaraj 2010) etc. Dissimilarity measures for binary variables such as matching coefficient (Cheetham and Hazel 1969) and measures such as Euclidean distance and angular separation (Haralick et al. 1973) for the numerical variable were also used to measure interclass distances. For probabilistic dependence, measures such as Chernoff (Lan et al. 2007), and Bhattacharyya (He et al. 2009) were used to measure probabilistic dissimilarities.

The advantage of information-based measure is that these measures are not computationally complex, require less time to compute, and usually require one-time calculation only. For applications where time and computational complexity is a limitation, these measures are the feasible option. The drawback is that it does not guarantee the best performance.

Classifier-based measures These measures used classification accuracy for evaluation of a channel subset. A classifier is responsible for finding the separability measure and a channel is selected when the classifier performs well. The error rate of the classifier is a widely used measure to evaluate the classification algorithm. Other measures used in the literature include Chi-squared (Greenwood and Nikulin 1996), information gain, odds ratio, and probability ratio (Şen et al. 2014). The problem with classifier-based evaluation is that the results are specific to the classifier used and changing the classifier effects the performance. The problem with classifier-based evaluation is that the results are specific to the classifier used and changing the classifier effects the performance. The other problems are the computation and time complexity. However, it guarantees the best results.

4.2 Channel selection approaches

There are several characteristics of every channel selection method. Filter methods used statistical properties of channels to filter out weak channels. These methods are classifier independent and rely on information measure. Using information measures for evaluation of channels helps filter method to be more generic and computationally more efficient. However, these methods usually have a poor performance than other methods. Wrapper method, on the other hand, is computationally demanding as the channel subsets were evaluated by a classifier, so the approach is classifier dependent, but it tends to give much better classification accuracy.

Hybrid channel selection techniques have both the properties of filter and wrapper methods and thus can be used to create more generic channel selection methods. There is a new approach that uses different channel selection methods instead of selecting one and accepting its results as the final channel subset. Then combine the results with an ensemble approach to get the best channel subset because there can be more than one optimal channel subsets (Haury et al. 2011; Abeel et al. 2009).

4.3 Search algorithms

There are three main types of search algorithms: complete search, sequential search, and heuristic/random search. Each has its own pros and cons. The complete search methods guarantee to find the best channel subset according to the evaluation criteria. The most common complete search algorithm is the branch and bound (Burrell et al. 2007). In sequential search, channels are added and removed sequentially, and these methods are usually not optimal but simple to implement and fast. The common sequential search methods are SFS (Pudil et al. 1994), SBS (Reunanen 2003), plus l minus r (Nakariyakul and Casasent 2009), and SFFS (Reunanen 2003). The random search method introduces randomness into the above-mentioned search algorithms and generates the next subset randomly. Simulated annealing (Yang 2014) and genetic algorithms (Davis 1991) are the best examples of the random search. These approaches are useful when data is too big and computational time is short.

4.4 Feature extraction

In channel selection for EEG applications, features extraction also plays an important role in maximizing the performance of the system. Those feature extraction techniques should be used to maximize the interclass variability. The most commonly used features for MI applications can be divided into four categories: time domain, parametric model based, transformed domain and frequency-based features. In time domain, features such as mean, variance, Hjorth parameter (Vidaurre et al. 2009) were used but these features were considered weak features. Auto-regression model (Pfurtscheller et al. 1998) and common spatial patterns (CSP) (Wang et al. 2006) were the examples of model-based features. Fourier and Wavelet transform (Xu and Song 2008; Baig et al. 2016) were also used in the literature to extract features and these features lie under transform features category. For frequency domain features, power spectral density and spectral edge frequency-based features were mostly used in the literature (Baig et al. 2014; Pfurtscheller and Da Silva 1999). After reviewing the literature for channel selection algorithms, the most effective features are the ones extracted by CSP and its variations.

4.5 Feature selection

Feature selection is also important in improving the system performance and for the selection of suitable channels. The features extracted from the EEG might be of high dimensionality, redundant or contains outliers that degrade the performance of the system. To deal with these factors, the feature selection algorithm is the option. It is also essential to apply feature selection algorithm to reduce the computational cost if the EEG data is too big (Al-Ani and Al-Sukker 2006). Extracting and selecting the best feature that maximizes the classification accuracy or class variance will eventually lead toward selecting optimal and relevant channels for EEG application.

5 Conclusion

The evaluation of channel selection algorithms is a challenging task. Various parameters, such as time, complexity and accuracy can be used to evaluate different channel selection algorithms. For real-time applications, time and accuracy are the most important parameters to consider. The performance of wrapper and hybrid selection techniques heavily rely on the selection of classifier and the subject. However, filter techniques are subject and classifier independent as these techniques use other signal information related parameters to select the best channel subset. The other main question in channel selection is how to determine the optimal number of channels to be selected. The answer to this question is very complicated because the human brain is the most complex thing in the world. The generalization of the EEG decoding methods is very difficult, a slight change in the experiment will affect the signal processing, feature extraction, and classification methods. The case of channel selection is the same, in which the optimal channel set is highly dependent on the application, features, evaluation criteria and classifiers. Traditional approaches used the criteria based on the convergence of the classification accuracy using cross-validation or an analytical solution to an optimization problem etc. to select the optimal number of channels but the idea for the optimal number of channels are the ones that can preserve information of a task more than the others. In the end, it is a trade-off between the number of channels, system performance, time and computation cost. After reviewing the literature, we conclude that the brain cortex regions that are relevant to application often appear in the optimal channel subset. To design the optimal channel selection algorithm, we have to perform an extensive and deep analysis of each technique. We also need to study the performance sensitivity based on different types of tasks and classifiers. The application of evolutionary algorithms in selecting channels for EEG is under research and is open for further investigation.