Covariate shift adaption by normalized principal components
In this section we propose a method for covariate shift adaption which uses Principal Component Analysis (PCA)  to extract the most important principal components and normalizes these components by shifting a window over the data to reduce the effect of non-stationarity. The normalization is similar to the method described in [8, 9], but normalizes each feature individually instead of normalizing a linear combination of all features.
PCA is a method that uses an orthogonal transformation to convert a set of possibly correlated variables into a set of uncorrelated variables. These uncorrelated variables are called principal components and are sorted by the amount of variance that the principal components account for in the original data. The first principal component accounts for the highest proportion of variance in the original data.
The proposed method is applied after feature extraction when the power spectrum for each channel has been estimated. When having n trials of training data, the dataset consists of a matrix D with dimension n×p, with the number of features p=(channels·bins) and D(i,j) being the value for trial i and feature j. For the covariate shift adaption, first a PCA is applied for dimensionality reduction and extraction of non-stationary components. As a next step the m principal components with the highest variance are selected, resulting in a p×m transformation matrix W and a matrix P=D·W that represents the m principal components. For the data presented here m = 100 was used, which seems to be a robust value giving good results. As a next step a rectangular window of length w is defined, which is shifted over the data and the value of each P(i,j)is normalized by the preceeding w trials with
For all with i≤w the window is used. We also experimented with different types of windows, e.g. a half hanning window, but found the rectangular window to give best results.
When using the method online, the last w trials (P t−w,…,P t−1) are kept in a buffer B and the principal components for a new trial
are calculated by
is normalized by the mean of the buffer
is added to the end of B and the first trial in B is removed, to keep the latest w trials in B. can then be used for classification.
Covariate shift adaption methods
In the following, we give an overview of the different methods for covariate shift adaption that are tested in this paper. The covariate shift adaption methods were applied after the signal processing, which will be explained in the next section.
Satti et al.: in  a method for covariate shift adaption was proposed, that uses a polynomial function for estimating the covariate shift of the next trial and adapt the data accordingly. In the following we used a polynomial of order 3 like stated in  and used the previous w trials to fit the polynomial to the data.
baseline: as a reference method we use results without covariate shift adaption.
pcanorm: this is the method for covariate shift adaption by normalized principal components as proposed in this paper in the previous section.
pcapoly: with this method we propose a slightly different approach than the one presented in the previous section. Also a PCA is used, but instead of shifting a window over the data and normalizing by the last w trials, a method similar to  is used: a polynomial is fitted to the content of the window and the next trial is adjusted by the value, the polynomial function would estimate. Again a polynomial of order 3 is used.
pcaonly: for this method, PCA was used for extraction of non-stationaries without any further covariate shift adaption.
Although different w were tested, for preparing the results w was kept constant at w=15, to provide a fair comparison between the methods.
To evaluate the advantage of different covariate shift adaption methods we performed an offline analysis on data recorded for another study . In this study 10 subjects performed motor imagery of right hand movement and a subtraction task. In the subtraction task the subject had to do subtractions by choosing a random number (around 100) and subtract 7. The result should not be communicated but it should be continued by subtracting 7 from the result and doing this all over until the end of a trial. Two sessions were recorded on different days with 51 trials per task and session. Recording was done with a 275-channel whole-head MEG-system (VSM MedTech Ltd.) at a sampling rate of 586 Hz. Each trial lasted 4.05 seconds with about 6 seconds of break between the trials. Instructions were given on a screen and a fixation cross was displayed during trials to minimize eye movement.
Signal processing and classification
The signals were filtered and resampled to 200 Hz. For spatial filtering a small Laplacian derivation was applied. To reduce the number of channels, we only used the 185 inner channels, which should also reduce the influence of possible artifacts, which are most prominent on the outer channels. After the preprocessing the power spectrum was estimated by an autoregressive model computed with the Burg Method, as it was used in a previous MEG-BCI . A model order of 16 was used, since we obtained best results with this model order in previous MEG-BCI experiments. We used the frequency range from 1 to 40 Hz with a bin width of 2 Hz. The logarithm function was applied to each value. Before classification we used r2-ranking  for feature selection. The number of features was not estimated individually on the training data, which would have introduced overfitting in our experience. Instead a fixed number of 1000 features was used, which gave on average the best results when evaluated by cross-validation. Each feature was normalized to have zero mean and unit variance for the training dataset. The test dataset was scaled according to the mean and standard deviation of the training dataset.
For classification we used LibSVM  with C = 1 and a linear kernel. We decided against a parameter estimation by gridsearch and cross-validation because it introduced overfitting in previous experiments.
Offline accuracy evaluation
To evaluate the performance of the pcanorm-method proposed in this paper and to compare it to other previously described covariate shift adaption methods, we trained the classifier after using the respective covariate shift adaption method on session 1 of the data and tested it on session 2, referenced to as S1S2-validation later. This method especially addresses the benefits of the covariate shift adaption methods in context of the session-transfer problem.
For comparison reasons we also performed a 5x10-fold crossvalidation on all data, in which the data was permuted and partitioned into 10 blocks with equal size. In each fold 9 blocks were used for training the classifier (including feature selection and PCA) and tested on the one remaining block. Each block was used for testing once. This procedure was repeated 5 times and the accuracy was averaged over all folds.
While non-stationaries have a great effect in the S1S2-validation, since the test set has an unknown data distribution, this effect should be minimized when using a crossvalidation (CV), because of the data being permuted and the classifier knowing the data distribution from both sessions. Using both validation methods allows for a direct estimation by how much non-stationaries are alleviated by the covariate shift adaption methods. To specifically adress this issue and due to the fact that the proposed method wouldn’t make sense on permutated data, no covariate shift adaption was performed for the crossvalidation. For a fair comparison, we not only used the baseline method but also the baseline method combined with PCA for a dimensionality reduction to the m=100 principal components with the highest variance.Although the number of features used for PCA-based methods differs from the number of features used for the baseline-method, we always used the number of features that gave best results in a cross-validation for the specific method.
To confirm the results from the offline analysis, we integrated the test of the proposed method in an ongoing online experiment with 10 subjects, who had to perform motor imagery and mental subtraction. To explicitly evaluate the covariate shift adaption in context of the session-transfer problem, each subjects should participate in two sessions. In the first session 200 trials training data were recorded. In the second session the classifier was trained on the training data from the first session and the proposed method was tested with online feedback in 200 trials. Each session was seperated into runs with 40 trials and a short break after each run.
Recording was done again with a 275-channel whole-head MEG-system (VSM MedTech Ltd.) at a sampling rate of 586 Hz. During measurement the head position was continously recorded. A Notebook with an Intel Core i7 720QM and 4GB memory running BCI2000  was used for signal acquisition, signal processing, feedback presentation and classification. The design of the paradigm and the corresponding time intervals are shown in Figure Figure 1. During the test phase feedback was given after every trial, which indicated the result of the classifier. Since the online test of the proposed method was integrated during the ongoing experiment, it should be noted that the first 4 subjects (B01,B03,B07,B08) did receive feedback without the covariate shift adaption method and the results of these 4 subjects shown below are from a simulated online experiment. The other 6 subjects received online feedback with the covariate shift adaption method proposed in this paper.
To test how the performance deterioration during one session is affected by the proposed covariate shift adaption method, we did a linear regression (least squares regression by Matlab’s polyfit function) on the accuracy throughout a session and used the slope of the regression line as a measure for performance deterioration.
To compare the online results with the pcanorm-method to the baseline-method, the baseline-method was applied offline to simulate the online experiment with the same data and same parameters but different covariate shift adaption method.