Adaptive learning with covariate shift-detection for motor imagery-based brain–computer interface

A common assumption in traditional supervised learning is the similar probability distribution of data between the training phase and the testing/operating phase. When transitioning from the training to testing phase, a shift in the probability distribution of input data is known as a covariate shift. Covariate shifts commonly arise in a wide range of real-world systems such as electroencephalogram-based brain–computer interfaces (BCIs). In such systems, there is a necessity for continuous monitoring of the process behavior, and tracking the state of the covariate shifts to decide about initiating adaptation in a timely manner. This paper presents a covariate shift-detection and -adaptation methodology, and its application to motor imagery-based BCIs. A covariate shift-detection test based on an exponential weighted moving average model is used to detect the covariate shift in the features extracted from motor imagery-based brain responses. Following the covariate shift-detection test, the methodology initiates an adaptation by updating the classifier during the testing/operating phase. The usefulness of the proposed method is evaluated using real-world BCI datasets (i.e. BCI competition IV dataset 2A and 2B). The results show a statistically significant improvement in the classification accuracy of the BCI system over traditional learning and semi-supervised learning methods.

S e e h t t p://o r c a .cf. a c. u k/ p olici e s. h t ml fo r u s a g e p olici e s.Co py ri g h t a n d m o r al ri g h t s fo r p u blic a tio n s m a d e a v ail a bl e in ORCA a r e r e t ai n e d by t h e c o py ri g h t h ol d e r s .

Introduction
In traditional machine learning techniques, data are assumed to be drawn from stationary distributions.While training a traditional supervised classifier, it is commonly assumed that the input data distribution in the training set and the testing set follows the same probability distribution (Grossberg 1988;Mitchell 1997;Kelly et al. 1999;Vapnik 1999;Duda et al. 2001;Bishop 2006).However, in real-world applications, processes are non-stationary and are often characterized by a shifting nature, as the data distribution may shift over time.
With applications working in non-stationary environments (NSEs), the data distribution shifts over time; in general this may be due to thermal drift, ageing effects, and noise.The scenario where the training set and testing set follow different distributions but the conditional distribution remains unchanged is known as covariate shift (Sugiyama et al. 2007;Li et al. 2010).In most of the real-world applications, nonstationarity is quite common, especially with the systems interacting with the dynamic and evolving environments, e.g., data coming from electroencephalogram (EEG)-based brain-computer interfaces (BCIs), share price prediction in stock market, and wireless sensor networks.Achieving high classification accuracy in a BCI is a particularly challenging task because the signals may be highly variable over time.
A BCI is an alternative communication's means, which allows a user to express his or her will without muscle exertion, provided that the brain signals are properly translated into computer commands (Wolpaw et al. 2002).With an EEG-based BCI that operates online in real-time nonstationary/changing environments, it is required to consider input features that are invariant to shifts of the data during long and across sessions, or learning approaches that are able to detect the changes that may repeat overtime, to update the classifier in a timely fashion.The non-stationarities in the EEG may be caused by various reasons such as changing user attention level, electrode placement, and user fatigue (Li et al. 2010;Blankertz et al. 2008;Raza et al. 2015b).Due to these non-stationarities, it is expected to find notable variations or shifts in the EEG signals during trial-to-trial, and session-to-session transfers (Blankertz et al. 2002;Li et al. 2010;Arvaneh et al. 2013a;Raza et al. 2013aRaza et al. , 2015b)).These variations often appear as covariate shifts in the EEG signals, wherein the input data distributions differ significantly between training/calibration and testing/operating phases, while the conditional distribution remains the same (Raza et al. 2013b;Satti et al. 2010;Sugiyama et al. 2007;Shimodaira 2000;Raza et al. 2014).To date, the low classification accuracy has been one of the main concerns of the developed BCI systems based on a motor imagery (MI) detection, which directly affects the reliability of the BCI (Li et al. 2010;Blankertz et al. 2008;Rezaei et al. 2006).To enhance the performance of BCI systems, several feature extraction, feature selection, and feature classification techniques have been proposed in the literature (Shahid and Prasad 2011;Suk and Lee 2013;Kuncheva and Faithfull 2014;Buttfield et al. 2006;Vidaurre et al. 2006;Coyle et al. 2009;Ramoser et al. 2000;Arvaneh et al. 2013a, b).A large variety of features have been used in MI-based BCI such as band powers, power spectral density, time frequency features, and common special patterns (CSP)-based features (Raza et al. 2015a).However, due to brain's non-stationary characteristics, the spatial distribution of the brain-evoked responses may change over time, resulting in shifts in feature distributions (Herman et al. 2008).
The main drawback of the solutions proposed in the related literature is the requirement of labeled data before starting the adaptation in the evaluation/operating phase (Li et al. 2010;Sugiyama 2012).Additionally, most of the shift-detection methods present in the literature are based on the batch processing for a dataset shift detection (Gama and Kosina 2014;Alippi et al. 2013;Elwell and Polikar 2011;Gama et al. 2014), so there is a time delay in shift-detection.Hence, for real-time systems, the batch processing methods are not beneficial where initiating adaptation in the nick-of-time is of supreme interest.In this paper, we present a novel design methodology for an adaptive classification, which monitors the covariate shift in the input streaming data (i.e., EEG features) through an exponential weighted moving average (EWMA) model-based covariate shift-detection (CSD) test (Raza et al. 2013a, b).The CSD test operates in two stages: the first stage deals with covariate shift-detection, and the second stage corresponds to the covariate shift validation.This two-stage structure helps in reducing the false detection rate, which may reduce an unnecessary retraining of the classifier.The classifier adaptation is only initiated once the covariate shift is confirmed using validation; after validation, the classifier is retrained based on the updated knowledge base (KB) discussed later in Sect. 4. The proposed method uses two different adaptation mechanisms to update the knowledge base (KB i.e., training data) of the classifier on the new knowledge.In the first method, a transductive learning approach is used to add the relevant information to the KB after each CSD.Moreover, the transductive learning is only used to increase the size of KB, but the overall classification is performed using an inductive classifier.In the second method, the KB is updated incrementally using the correctly predicted labels after each CSD.The experiments on the realworld datasets are used to show that the covariate shift can be adapted using the proposed method.Using the data from the BCI competition-IV 2A and -2B, we have demonstrated that the proposed method can outperform a traditional learning approach and other competing methods.It is to be noted that a preliminary work related to the proposed methodology was presented in our conference paper (Raza et al. 2014) and here, we extend the study of adaptive learning with covariate shift-detection by conducting an extensive experimental evaluation on motor imagery-based BCI datasets.In particular, our main focus is to account for covariate shift which may arise during session-to-session transfer in BCI experiments.In addition, we perform a thorough analysis on the feature extraction techniques, to extract better discriminative features for the classifier.The novel contributions of the paper can thus be summarized as follows: • A covariate shift-adaptation model is introduced to address the effects of non-stationarity in the EEG signals.The remainder of the paper proceeds as follows: first, Sect. 2 describes the proposed methodology for the covariate shiftdetection, -validation, and -adaptation; Sect. 3 presents an application of the method to BCI.Then, the results are detailed in Sect. 4. Finally, the implications of the results are discussed in Sect. 5.

Adaptive learning problem formulation
Let us consider a learning framework in which training dataset is denoted by X Tr ={ (x i , y i )} N i=1 , where N is the number of observations, and a target label y i is associated with each input x i .Depending upon the number of inputs and outputs, x i and y i may be scalar or vector variables.In the following work, the training dataset is represented as initial KB.Let us consider a two-class classification problem, i.e., y ∈{C 1 , C 2 }, where y i = C 1 ,ifx i belongs to class ω 1 , and y i = C 2 ,ifx i belongs to class ω 2 .For example, in support vector machine (SVM), we have C 1 =− 1, and C 2 =+ 1.The probability distribution of the inputs at time i can thus be defined as P(x i ) = P(ω 1 )P(x i |ω 1 ) + P(ω 2 )P(x i |ω 2 ), where P(ω 1 ), P(ω 2 ) are the prior probabilities of getting a sample of the classes ω 1 and ω 2 , respectively, while P(x i |ω 1 ), P(x i |ω 2 ) are the conditional probability distribution for the time period i.
The goal is to predict the labels of upcoming samples , where M is the number of observations in the testing phase.

Algorithm overview
The proposed algorithm with the covariate shift-detection (CSD) belongs to the category of incremental learning (Elwell and Polikar 2011), where the learning model is updated at each CSD.The covariate shift monitoring is performed using the CSD-EWMA test (Raza et al. 2013a(Raza et al. , b, 2015b).An advantage of using the CSD test is the enhanced accuracy in terms of low false-positives and low falsenegatives.The proposed algorithm is a single classifier-based non-stationary learning (NSL) algorithm that uses the CSD-EWMA test for initiating adaptive corrective action.The algorithm is provided with a time-series training dataset KB, where KB = X Tr , and a classifier F is trained.In the evaluation phase, the CSD-EWMA test is used to monitor and detect the covariate shift.Then, the classifier F is used to classify the upcoming input data X Ts .
The key elements of the proposed solution are • CSD: CSD test monitors the stationarity of x i , disregarding their supervised labels.• F : The pattern classifier F is used to classify the input samples.
• KB: The current knowledge base (KB) updated on each CSD.
The proposed solution is described in Algorithm 1.After a preliminary configuration phase of the initial classifier F and CSD on KB, the CSD is used to assess the process stationarity.As soon as the CSD-EWMA test detects a covariate shift in the upcoming unlabeled data, the classifier learned model becomes obsolete and has to be replaced with a newly configured/retrained model.At each CSD, the new information (i.e., KB New ) becomes available containing the information about the new data distribution.Next, the KB New is merged with existing KB, and a new KB is prepared.To prepare the updated KB, two methods are identified: the first is a transductive learning with CSD (TLCSD), and second is an adaptive learning with CSD (ALCSD).The interactions between the covariate shift-detection, -validation, and -adaptation stages are more clearly illustrated with the help of Figs. 1 and 2, which are explained in the following subsections.

Covariate shift-detection (CSD)
The first step required in a CSD test is to detect the covariate shift in the process, possibly without relying on the prior information about the process data distribution before and after the shift.This is a crucial step for reconfiguring the classifier, and it acts as an alarm.The first stage of the test provides an initial estimate of the shift (i.e., where the actual shift has occurred).The first stage test is performed by an SD- EWMA test (Raza et al. 2013a).If the test outcome at the first stage is positive, then the second stage test gets activated, and a validation is performed to reduce the number of false alarms (Raza et al. 2013b).The second stage test/validation procedure is discussed in next sub-section.The choice of the smoothing constant λ and a control limit multiplier (L)a r e the important issue in the EWMA-CSD test.The choice of λ and L are discussed in Sect. 4.
In an EEG-based BCI, the EEG signals are obtained from multiple electrodes, and the application of a feature extraction procedure results in a set of features, and hence BCI input data are multivariate.Monitoring of such input processes independently may be misleading, e.g., if the probability that a variable exceeds three-sigma control limits is 0.0027, then a false detection rate of 0.27 % is expected.However, the joint probability that d variables exceed their control limits simultaneously is (0.0027) d .So, the use of d-independent control-charts may provide highly distorted outcomes.A principal component analysis (PCA) is, therefore, used to reduce the dimensionality of the data (Rosenstiel et al. 2012;Kuncheva and Faithfull 2014).It provides fewer components, containing most of the variability in the data.We have used a single component to monitor the shift in the process using SD-EWMA test (Raza et al. 2013a) at the first stage.

Covariate shift-validation
According to Algorithm 1, the KB of the classifier has to be updated at each CSD.However, false positives (i.e., detection that does not correspond to a true shift in the input distribution) result in an unnecessary retraining.To counter this, we have introduced a covariate shift-validation procedure as part of a two-stage structure test (Raza et al. 2013b).This strategy aims at guaranteeing that the classifier relies on an up-to-date KB, and the classifier is only retrained on the occurrence of a valid shift.The covariate shift-validation procedure exploits two sets of observations generated before and at the CSD time point.The observations from the KB are assumed to be in its stationary state and are compared with data from the current trial, at the CSD time point.To validate the CSD from the stage-I, a multivariate Hotelling's T square statistical hypothesis test is used (Hotelling 1947).If the p value of the test is below 0.05, then the CSD is confirmed; otherwise it is considered as a false-alarm.On each CSD, the KB New is obtained based on the current shift in the data.

Covariate shift-adaptation
Once the CSD is validated, the adaptation phase starts (see Fig. 2).To adapt to the shift, re-training the classifier is required.In order to retrain the classifier, an additional set of input target pairs is necessary to prepare the KB.To get the set of input target pairs, we have investigated two ways for the KB management.In the first scenario (i.e., TLCSD), we have applied a transductive-inductive learning model to adapt to a potential covariate shift.However, the transduction part is only used to add new trials into KB New , and an inductive classifier is used to classify the upcoming samples from the evaluation phase.The transduction part will only start once the covariate shift is detected and validated.In the second scenario (i.e., ALCSD), it is assumed that during the evaluation phase, a true label is available after each trial.Once the covariate shift is detected, then only correctly predicted labels are added into KB New , the classifier is re-trained, and the updated classifier is used for further classification.This approach is similar to co-training (Zhu 2008) used in a semisupervised learning (SSL), where the predicted labels are used to train another classifier.
Both the methods mentioned above that are used to adapt the classifier in relation to the covariate shift are presented hereafter.

Transductive learning with CSD (TLCSD)
A TLCSD model is based on a probabilistic K -nearest neighbor (KNN) method.Initially, according to Algorithm 1, at step 1, an inductive classifier F is trained on the initial KB, and at step 2, the parameters λ and L are set for the CSD test.Once the classifier F is trained, then an evaluation phase starts.At step 3, the parameters λ, L,C R Thres , and K are set, wherein CR Thres is a confidence ratio threshold that is used to decide the usefulness of the trial, and K is the number of neighbors for the transductive learning.In the evaluation phase, the classifier takes the features as the input obtained from the testing data.The classifier initiates adaptation through transduction after every CSD.Each time the classifier initiates adaptation at step 7, it is considered as one epoch, and it takes m data points to predict the labels through a transductive function T , where m is the number of points between two shift-detection points, or from the start of evaluation phase to the first detection point.Once the adaptation is initiated at each epoch, the Euclidean distance (d p,q ) from the unlabeled data point x p to the labeled data point x q is computed as given below: (1) This provides a vector D =[d ( p,q 1 ) ,...,d ( p,q N ) ] of Euclidean distances from unlabeled data point to the N number of labeled data points.Then, the K nearest neighbors are selected.For each of the K nearest points, an RBF kernel is used to compute the weight, as given in Eq. ( 2).
From Eq. ( 2), we have 0 ≤ K ( p, q) ≤ 1.A weight with a high value implies the data-point's closeness to the unlabeled current feature.Thus, the weight for each neighbor is given by Using R(i) and the existing KB, for each of the classes a confidence ratios CR ω i is obtained by The confidence ratio CR ω i attained from Eq. ( 4a) and (4b) may be viewed as a posterior probability of the class membership of the current unlabeled data point, as CR ω 1 +CR ω 2 = 1.This CR ω i acts as a belief or confidence, which determines if a data sample belongs to a particular class.In this step, for each observation from the m data points are obtained, and CR ω i to decide if both the trial's features and the estimated output labels should be added to the existing knowledge-base, i.e. if max(CR ω 1 , CR ω 2 )>CR Thres , then the couple (EEG signal corresponding to the trial, estimated output label) is added into KB New ; otherwise it is discarded.At step 7, this KB New is then merged into the existing KB.Based on the updated KB, the inductive classifier function is updated, and a new classifier F is obtained at step 8. Every time a new KB New is created, the classifier F is updated, and this process is repeated until all the M points in the testing phase are classified.

Adaptive learning with CSD (ALCSD)
In ALCSD, initially at step 1 of Algorithm 1, an inductive classifier F is trained with the initial KB of N labeled trials.Using KB at step 2, the parameter λ is obtained for the CSD test, and the control limit (L) for the CSD is set to L = 2.Then, an evaluation phase starts at step 4, and unlabeled features from X Ts are processed sequentially for classification.At step 6, the CSD test is used to monitor the covariate shift.
Once the covariate-shift is detected, it acts as an alarm to update the classifier.To update the classifier, new knowledge from the data is required.In order to obtain KB New ,it is assumed that in each trial, the true label is available, and among all predicted labels only correctly predicted labels through an inductive classifier are added into KB New .K B is updated with the content of KB New at step 7. KB is used to retrain the classifier at step 8, and further at step 10, this updated classifier is used to classify the upcoming data.On each CSD, KB gets updated, and a new classifier is created.

Application to brain-computer interface
3.1 Data description

BCI Competition IV dataset 2A
The BCI competition IV dataset 2A (Tangermann et al. 2012) is comprised of the EEG data collected from nine subjects, namely (A01-A09), that were recorded during two sessions on separate days for each subject.The data consist of 25 channels, which include 22 EEG channels, and 3 monopolar EOG channels.Among the 22 EEG channels, 10 channels are selected for this study, which are responsible for capturing most of the motor imagery activities.The selected channels are presented in Fig. 3a.The data were collected on four different motor imagery tasks: left hand (class 1), right hand (class 2), both feet (class 3), and tongue (class 4).Each session consists of six runs separated by short breaks, and each run comprised of 48 trials (12 for each class).The total numbers of 288 trials are in each session.Only the class 1 and the class 2 for left hand and right hand were considered in this study (i.e., 144 trials).For more details about the dataset kindly refer to (Tangermann et al. 2012).The motor imagery data from the session-I were used to train the classifiers, and the motor imagery data from the session-II were used as the test dataset.

BCI competition IV dataset 2B
BCI competition 2008-Graz dataset B (Tangermann et al. 2012) is a dataset consisting of EEG data from 9 subjects, namely (B01-B09).Three channel bipolar recordings (C3, Cz, and C4) were acquired with a sampling frequency of 250 Hz; the montage is depicted in Fig. 3b.All signals were recorded monopolarly with the left mastoid serving as a reference and the right mastoid as a ground.For each subject, five sessions are provided.The motor imagery data from session-I and -II were used to train the classifiers, the data from session-III were used to obtain the hyperparameters (i.e., K and CR Thres ), and the motor imagery data from session-IV and -V were used to evaluate the performance of the test.Session-IV and -V consist of 160 trials each.Each trial is a complete paradigm of 8 s; for more details refer to Tangermann et al. (2012).

Temporal filtering
The second stage of the MI-based BCI block diagram (see Fig. 4) employs two filters that decompose the EEG signals into two different frequency bands.Two band-pass filters are used, namely (8-12) Hz (μ band) and (14-30) Hz (β band).These frequency ranges are used because they cover a stable frequency response related to MI-associated phenomena of event-related synchronization and de-synchronization (ERS/ERD).In the next sections, we consider a time segment of 3 s after the cue onsets for both data sets.

Spatial filtering
The third stage employs a spatial filter that maximizes the variance of spatially filtered signals under one condition, while minimizing it for the other condition.Raw EEG scalp potentials are known to have poor spatial resolution due to volume conduction.If the signal of interest is weak while other sources produce strong signals in the same frequency range, then it is difficult to classify two classes of EEG measurements (Blankertz et al. 2008).The neurophysiological background of motor-imagery based BCIs is that motor activity, both actual and imagined, causes an attenuation or increase of localized neural rhythmic activity, called event-related desynchronization (ERD) or event-related synchronization (ERS).The common-spatialpattern (CSP) algorithm is highly successful in calculating spatial filters for detecting (ERD/ERS) (Ang et al. 2008(Ang et al. , 2012)).The objective of the CSP algorithm is to compute features whose variances are optimal for discriminating two classes of brain-evoked responses in EEG signal.
A pair of band-pass and spatial filters in the first and second stages perform spatial filtering of EEG signals that have been band-pass filtered in a specific frequency range.Thus, each pair of band-pass and spatial filter computes the CSP features that are specific to the band-pass frequency range.CSP is a technique to analyze multichannel data based on the recording from two classes (Blankertz et al. 2008).It is a data-driven supervised decomposition of signals parameterized by a matrix W ∈ R C×C (C: number of selected channels) that projects the single trial EEG signal E ∈ R C in the original sensor space to Z ∈ R C , which lives in the surrogate sensor space, as follows: where E ∈ R C×T is a EEG measurement data of single trial, C is the number of channels; T is the number of samples per channel.W is the CSP projection matrix.The rows of W are the spatial filters, and the columns of W are the common spatial patterns.The spatial filtered signal Z given in Eq. ( 5) maximizes the difference in the variance of the two classes of EEG measurements.A CSP analysis is applied to obtain an effective discrimination of mental states that are characterized by ERD/ERS effects.However, the variances of only a small number (m) of the spatial filtered signal are generally used as feature for classification.The m first and last rows of Z, i.e.Z t , t ∈ {1 ...2m} form the feature vector x t given by Here, m = 1.The CSP features from both frequency bands are combined to form the input features for training a classifier.Figure 5 shows the covariate shift in the CSP features for both training and test datasets for subject A03 over two different frequency bands mu (μ) (8-12) Hz and beta (β) (14-30) Hz.The blue crosses and red circles denote the features of the left hand and right hand motor imagery, respectively.The black line and red line represent the separation planes between the features of two classes obtained from two frequency bands as training and testing features, respectively.The separation planes are plotted for illustration purpose only.

Covariate shift-detection (CSD)
The fourth stage uses the CSD test on the CSP features.In both datasets, the data are generated from multiple channels, and for each channel two features are produced from each frequency band.To use CSD-EMWA, PCA is used to reduce the number of the features, and a single component is used to detect the covariate shift.To execute the CSD test, the smoothing constant λ is selected for each subject based on minimizing the sum of squares of 1-step-ahead prediction error method, and the control limit multiplier is set to L = 2.The choice of L has a major impact on the performance of the CSD test, a small value of L makes it more sensitive in detecting minor shifts in the data.The CSD test in the operational stage detects the shifts and validates it through its two-stage structure.If the CSD test is positive then a classifier is retrained on the KB.

Experimental setup and classification evaluation metrics
In order to evaluate the performance of the system, we have considered the classification accuracy (in %) as the measure of index.The experiments are performed using a linear support vector machine (SVM) pattern classifier F.InCSD tests, the percentage (%) of covariate shift-detected and shiftvalidated is computed as given below: % of shift detected/validated = (# shift detected/validated) Total number of trials × 100 (7) The hyperparameters K and CR Thres are required to be carefully selected.Two variants of the proposed learning method, namely TLCSD 1 and TLCSD 2 , are, therefore, presented.In TLCSD 1 , the hyperparameters are selected based on grid search to maximize the mean accuracy across subjects, with K ∈{6, 12, 18}, and CR in the range (0.50-1).In TLCSD 2 , the hyperparameters are determined for each subject, based on a grid search to maximize the accuracy of each subject (subject-dependent).In dataset 2A, session-I is divided into two parts; the first 80 % is used for training the pattern classifier while the remaining 20 % is used to determine the hyperparameters.The evaluation is then performed on the data from session-II.In dataset 2B, sessions I and II (240 trials) are used for training the pattern classifier, session III (160 trials) is used to obtain the hyperparameters, and sessions IV and V (320 trials) are used to evaluate the performance of the classifier.For each dataset, the accuracy corresponding to a tenfold cross-validation (10-CV) on the training data is provided.Moreover, the two variants for the proposed methods are evaluated and compared with a baseline method and a label propagation-based semi-supervised learning (SSL) algorithm.An upper bound (UB) is also provided.It is obtained by training the classifier (F) on both the training and the test datasets, with an evaluation on the test data.The baseline method uses an inductive learning classifier with CSP features (Ramoser et al. 2000), but it does not adapt/re-train its pattern classifier over time.A graphbased SSL label propagation method (Zhu and Ghahramani 2002) has been considered for comparisons.To compare classifier performance with the baseline method, a two-sided Wilcoxon signed rank test is used to assess the statistical significance of the pairwise comparison at a confidence level of 0.05.

Results for dataset 2A
The results corresponding to the choice of the smoothing constant λ and the CSD are presented in Table 1.Thevalue of λ is obtained by minimizing the sum of squares of 1-stepahead prediction errors.In the data of subject A05, a shift was detected 15 times (i.e.10.42 % CSD), whereas it was detected only 7 times for subject A03 (i.e.4.86 % CSD).For subject A05, the CSD decreased from 10.42 to 4.17 % after the covariate shift-validation stage, and for subject A03, the CSD decreased from 4.86 to 1.39 %.The validation stage thus helps to decrease the rate of false positives at stage-II; consequently the effort of unnecessary retraining the classifier is also reduced.The classification accuracies on dataset 2A, for the different methods and for each subject, are given in Table 2.The For TLCSD 2 , all the subjects have shown an improvement, except for subject A08.The average accuracy of TLCSD 2 is 74.92±15.43%, which represents a significant improvement compared to the baseline method (p value = 0.0126).In ALCSD, the results have shown a minor improvement in the performance against the baseline method with the mean accuracy of 73.84 ± 15.93 %; only subjects A01, A02, A03, and A07 have shown improvement.The accuracy of UB is 76.70 ± 15.33 %, and it represents the performance that can be achieved if all the data are available for training, showing that the knowledge of the test data points in the evaluation of the classifier can improve the performance by only 3.23 %.

Results for dataset 2B
The results for the choice of λ and the CSD are presented in Table 3.In this dataset, sessions IV and V are used for evaluation phase; hence for each session the CSD test is performed independently.In session IV, the subject B01 has the maximum number of CSD (10 %), and subject B04 has minimum number of CSD (1.88 %).After the covariate shift validation stage, the number of CSD decreased from 10 to 4.38 % for subject A01, and the number of CSD decreased from 1.88 to 0.63 % for subject A04.Moreover, in session V, subjects B06 and B08 have the maximum number of CSD (10 %),  and subject B04 has the minimum number of CSD (3.75 %).
After the covariate shift-validation stage, the number of CSD decreased from 10 to 6.88 % for subject B06, and the number of CSD decreased from 10 to 5 % for subject B08.
The classification accuracies on dataset 2B, for the different methods and for each subject, are given in Table 4.The average accuracy with 10-CV is 70.71 ± 10.78 %, with subject B04 obtaining the maximum performance of 88.85 %.The baseline method gives 65.23 ± 13.98 % of average accuracy and subject B04 has the maximum accuracy of 93.13 %.The SSL-based label propagation method gives 62.74 ± 11.89 % average accuracy, which is below the baseline method accuracy.In TLCSD 1 , the parameters K and CR Thres have been fixed to K = 18 and CR Thres = 0.70, and the classification accuracy has slightly improved from 65.23 ± 13.98 to 66.15 ± 13.64 %.Next, for TLCSD 2 , all the subjects have shown an improvement.The average accuracy for TLCSD 2 is 69.72 ± 14.05 %, being statistically significantly better (p value = 0.00039) than the baseline method.
In ALCSD, the results have shown a considerable improve-ment in the performance against the baseline method with the mean accuracy of 67.88 ± 14.16 %, which is statistically significantly better than the baseline method (p value = 0.0039).Moreover, for ALCSD, all the subjects have shown an improvement.The UB method reaches an accuracy of 73.33 ±14.67 %. Figure 6, presents the average classification accuracy across subjects for both databases (2A and 2B).

Discussion
The proposed TLCSD and ALCSD methods for the EEGbased BCI are based on a covariate shift-detection and an adaptation framework.An EWMA-CSD test is used to detect the covariate shift.Once the shift is detected, an appropriate adaptive action is initiated to address the effect of the covariate shift.In TLCSD, the new information/knowledge obtained through transduction is used to update the KB (i.e., training data) of the inductive classifier.However, the main classification function is still inductive because the transduc-  tive knowledge is only used to add more information into KB.
An important issue in the CSD is the choice of the control limit multiplier L. Considering small limit L = 2 means focusing on minor shifts, such as muscular artifacts arising during trial-to-trial transfer.However, the long-term nonstationarities may be accounted for by considering a large value of L = 3, such as during session-to-session transfer or run-to-run transfer.We have selected a small value of control limit multiplier L = 2, as our aim is to detect the covariate shift that arises during trial-to-trial transfers.The proposed learning techniques make use of CSD to detect the shift and then adapt to non-stationarities in the streaming EEG.
The parameter CR Thres is used to decide whether the information in hand is useful or not.If the information is useful then it is added to the existing KB.The discarded information may come from a different distribution or it may have not provided much confidence to add into KB.The value of CR Thres and K are important and are required to be carefully selected to achieve superior performance.For instance, for the method TLCSD 1 ,thevalueofCR Thres is empirically selected in the range (0.50-1).In TLCSD 2 , the parameters are selected based upon a grid search method and the accuracy is superior for both of the datasets.This implies that the performance of the proposed method depends upon the optimal choice of CR Thres .
The experimental results demonstrated the effectiveness of the proposed covariate shift-detection and adaptation learning strategy.The results showed that the proposed method with CSP filters and optimized parameters is significantly better than the traditional learning methods and SSL with CSP filters.The combination of EWMA-based covariate shift-detection and adaptive learning is thus a good choice for learning in non-stationary environments.The robustness of the CSD test plays an important role in initiating a correct adaptive action.

Conclusion
The proposed methodology is a flexible tool for adaptive learning in non-stationary environments and effectively accounts for the effect of the covariate shifts.In this paper, two methods (TLCSD and ALCSD) were proposed for the covariate shift-adaptation using a two-stage covariate shiftdetection test.The CSD test in the first stage uses the SD-EWMA test; and in the second stage, the multivariate Hotelling's T square statistical hypothesis test is used.The CSD test is found very effective in detecting the covariate shifts in the data in real-time.Based on the detected significant shifts, the algorithm initiates adaptive corrective action.The performance of the proposed methods was evaluated on multivariate cognitive task detection problem in the EEGbased BCIs simulated with BCI competition IV datasets 2A and 2B, and a superior classification accuracy was obtained as both TLCSD and ALCSD have shown statistically significant improvement.This work is planned to be extended further by employing the CSD into the task of fault monitoring.
Pl e a s e n o t e: C h a n g e s m a d e a s a r e s ul t of p u blis hi n g p r o c e s s e s s u c h a s c o py-e di ti n g, fo r m a t ti n g a n d p a g e n u m b e r s m a y n o t b e r efl e c t e d in t his ve r sio n.Fo r t h e d efi nitiv e ve r sio n of t hi s p u blic a tio n, pl e a s e r ef e r t o t h e p u blis h e d s o u r c e.You a r e a d vis e d t o c o n s ul t t h e p u blis h e r's v e r sio n if yo u wi s h t o cit e t hi s p a p er. Thi s v e r sio n is b ei n g m a d e a v ail a bl e in a c c o r d a n c e wit h p u blis h e r p olici e s.

Fig. 1
Fig. 1 Architecture of the adaptive learning design methodology

Fig. 3 Fig. 4
Fig. 3 Electrode montage corresponding to the international 10-20 system: a dataset 2A, among all 22 EEG channels, only ten channels are selected as shown in black filled hollow circles.b Dataset 2B, all channels are selected

Fig. 5
Fig. 5 Covariate shift in the EEG dataset 2A-subject A03, between training and testing input distribution for different frequency bands.a Mu band (8-12) Hz, and b beta band (14-30) Hz.The red circles denote the features of the left hand motor imagery, and blue crosses denote the

Fig. 6
Fig. 6 Comparison of the mean accuracies for the proposed methods against the baseline, SSL, and UB on a the dataset 2A and b dataset 2B.The box plot represent the standard deviation across subjects

Table 1
Results for shift-detection and validation dataset 2A For the baseline results, an inductive classifier is used for the classification on the test data without any adaptation on the CSP features.The baseline method gives an average accuracy of 73.46 ± 15.94 %, and subject A03, who has the less number of shifts, has the highest accuracy (92.36 %).The SSL label propagation method gives an average accuracy of 69.91 ± 18.22 %, which is inferior to the baseline method.In TLCSD 1 , the parameters K and CR Thres have been set to K = 18 and CR Thres = 0.70, and the classification accuracy has improved slightly from 73.46 ± 15.94 to 74.07 ± 15.21 %.

Table 2
A two-sided Wilcoxon signed rank test is used to assess the statistical significance of the improvement at a confidence level of 0.05, the p value denotes the Wilcoxon signed rank test *

Table 3
Results for shift-detection and validation dataset 2B

Table 4
A two-sided Wilcoxon signed rank test is used to assess the statistical significance of the improvement at a confidence level of 0.05, the p value denotes the Wilcoxon signed rank test *