Improving P300 Speller performance by means of optimization and machine learning

Brain-Computer Interfaces (BCIs) are systems allowing people to interact with the environment bypassing the natural neuromuscular and hormonal outputs of the peripheral nervous system (PNS). These interfaces record a user's brain activity and translate it into control commands for external devices, thus providing the PNS with additional artificial outputs. In this framework, the BCIs based on the P300 Event-Related Potentials (ERP), which represent the electrical responses recorded from the brain after specific events or stimuli, have proven to be particularly successful and robust. The presence or the absence of a P300 evoked potential within the EEG features is determined through a classification algorithm. Linear classifiers such as SWLDA and SVM are the most used for ERPs' classification. Due to the low signal-to-noise ratio of the EEG signals, multiple stimulation sequences (a.k.a. iterations) are carried out and then averaged before the signals being classified. However, while augmenting the number of iterations improves the Signal-to-Noise Ratio (SNR), it also slows down the process. In the early studies, the number of iterations was fixed (no stopping), but recently, several early stopping strategies have been proposed in the literature to dynamically interrupt the stimulation sequence when a certain criterion is met to enhance the communication rate. In this work, we explore how to improve the classification performances in P300 based BCIs by combining optimization and machine learning. First, we propose a new decision function that aims at improving classification performances in terms of accuracy and Information Transfer Rate both in a no stopping and early stopping environment. Then, we propose a new SVM training problem that aims to facilitate the target-detection process. Our approach proves to be effective on several publicly available datasets.


Introduction
A Brain-Computer Interface (BCI) is a system that records a user's brain activity and allows him to interact with the environment by exploiting both signal processing and machine learning algorithms. In most cases, the recorded signals are noisy, so that filtering or averaging techniques are used to improve the signal-to-noise ratio (SNR). The information embedded in signals that are relevant to characterize the user's mental states are then selected during a feature extraction procedure before being classified and translated into artificial outputs -i.e. into control commands for an output device such as a pointer, a keyboard or a robotic arm [18,19,21,23,35]. BCIs use either electrical, magnetic and metabolic signals [35] recorded with methods such as electroencephalography (EEG), electrocorticography (ECoG), magnetoencephalography (MEG), functional Near Infra-Red Spectroscopy (fNIRS) and functional Magnetic Resonance Imaging (fMRI). In this framework, BCIs based on event-related potentials (ERPs) have proven to be particularly successful and robust [26]. ERPs represent the electrical responses recorded from the brain through EEG techniques after specific events or stimuli. The ERPs are embedded within the general EEG activity [29], and are time-locked to the processing of a specific stimulus. As their amplitude is lower that the one of the ongoing EEG activity, averaging techniques are employed to increase the SNR: in principle, averaging background noise which is not correlated to an event, such as the ongoing EEG activity, tends to reduce its contribution to a small offset, which can be easily filtered out, while the evoked responses, supposed to be the same after each stimulus, are left unmodified. An ERP-based BCI attempts to detect ERP components to infer the stimulus that the user intended to choose -i.e. the stimulus eliciting the ERP components [31]. In 1988, the P300 ERP was first used by Farwell and Donchin within a BCI system [7]. Their P300 Speller consists of 36 alpha-numeric characters arranged within the rows and columns of a 6 × 6 matrix. The user's task is to focus the attention on a specific character -i.e. on one of the cells of the matrix. Each of the 6 rows and 6 columns then flashes for few tenths of milliseconds in a random sequence. A sequence of 12 different flashes -the 6 rows and 6 columns -is called an iteration. It constitutes the basis of an oddball paradigm in which two classes of stimuli, namely the target (or rare) and the non target (or frequent), which occur with different probabilities (0.166 and 0.833 in this case), and that elicit two different brain responses. In particular, the target (rare) stimuli should elicit the P300 response which is not evoked after non target (frequent) stimuli. In our case the row and the column containing the attended character represent the target stimuli, while the other ten are the non-target ones. Brain responses to the target and non-target stimuli are distinguished using a classification algorithm. The correct identification of the target row and column allows the desired character's selection, which is located at their intersection [13,14,28].
Later on in the literature, different variations of the original P300 paradigm have been developed in order to improve the speller framework. For instance, in [25,27,32] the authors proposed gaze-independent spellers, i.e. communication systems that can be used by subjects who have impairment at moving their eyes. In all speller paradigms, given a sentence/run to copy-spell, the EEG data are organized in terms of trials, iterations, and sub-trials. A single character selection step is here referred to as a trial. Each trial consists of several iterations/stimulation sequences, during which all the stimuli are intensified once in a pseudo-random order. A single stimulus intensification is here referred to as a sub-trial. The trials' selection process can involve one ore two levels. In the former case, symbols are typically presented successively thus involving a single selection step. In the latter, the user has to select a group of symbols first and then the target symbol.
To use a BCI, two phases namely training/calibration and test/online are typically required. During the calibration phase, the user focuses his/her attention on a specific character. The acquired EEG signals are then preprocessed by filtering. A subset of EEG features is extracted to represent the signal in a compact form. The obtained EEG patterns are recognized using a classification algorithm, which is trained on the subset of identified features to determine the presence or the absence of a P300 evoked potential. In the online phase, new EEG patterns are classified using the trained model before being translated into a command for an application. As described above, in ERP-based BCIs, to perform a single selection step, multiple iterations are carried out to improve the SNR. However, repeated stimulations increase the time necessary to detect the brain signals reducing the communication rate. In this work, we explore how to improve the classification performance by combining optimization and machine learning.

Literature Review
As mentioned above, the presence or the absence of a P300 evoked potential within the EEG features is determined using a classification algorithm [13]. Formally, the detection of brain responses to the target and non-target stimuli can be translated into a binary classification problem. Let T S be the training set defined as: where n k denotes the total number of trials in the training phase and n r denotes the number of iterations for each trial; the number of flashes n f and the set of levels T together denote the set of possible stimuli that compose the stimulation sequence (i.e. n f = 6 and T = {row, column} for P300 Speller's paradigm or T = {outer, inner} for two-levels paradigms).
During the calibration phase, a classification algorithm is trained over T S to learn the discriminant function f such that and this function is used in the online phase to spell words or sentences. In the BCI literature, several algorithms have been proposed for addressing this classification problem [19]. In particular, linear classifiers such as stepwise linear discriminant analysis (SWLDA) [5], and support vector machine (SVM) [8] are still the most used discriminant algorithms for ERPs' classification [19]. These methods classify the brain responses by means of a separating hyperplane [13]. This discriminant function is built on the basis of the training data, and it is defined as: where w is the vector containing the classification weights and b is the bias term. Linear classifiers differ in the way they learn w and b [13]. In (3), the right-hand side is called decision value. Its absolute value is proportional to the distance of the sample points x from the separating hyperplane. In a standard binary classification problem, for each instance the class label is assigned based on the sign of the relative decision value. However, in a classical P300 Speller [7], based on the assumption that a P300 is elicited for one of the six row/column stimuli and finding that the P300 response is invariant to row/column stimulation, the target class is assigned to the stimuli matching the maximum decision values for both the rows and the columns [13]. In general, remembering the definition of T and n f given in (1), we can identify the target stimulus for trial k ∈ [1 . . . n k ] and iteration r ∈ [1 . . . n r ] as: The predicted character for trial k ∈ [1 . . . n k ] and iteration r ∈ [1 . . . n r ] is then identified by combining the predicted target stimuli found ∀t ∈ T (i.e a row target and a column target for the standard P300 paradigm). As mentioned in Section 1, for each character, data recorded from multiple iterations have to be integrated to improve the SNR. To the best of our knowledge there exist two main different iteration-averaging strategies in the literature: (i) ERP avg: for each character brains responses to target and non target stimuli are averaged across the iterations before being classified, and (ii) DV med: for each character the decision values of each target and nontarget stimulus are averaged across the iterations before assigning the target class. Recently in [3], a new classification function namely score-based function (SBF) has been introduced for integrating brain responses recorded from multiple iterations. For each character, the SBF exploits a set of heuristicallydetermined scores to weight each stimulus according to its decision value. For each stimulus, the assigned scores are summed up iteration by iteration. The target class (one for the row and one for the column) is assigned to the stimulus having the highest total score at the last available iteration. The SBF has been introduced for developing an early stopping method (ESM) -i.e. an automatic method that interrupts the stimulation at any point in a trial when a certain criterion, based on the ongoing classification results, is satisfied (see for instance [9,11,12,16,17,20,26,27,30,33,34,37]). The proposed ESM outperformed the current state-of-the-art early stopping methods.
In this paper, we follow the same line of research of [3], by making some further steps to include the information of the protocol into the classification phase. Indeed, the novelty of our approach consists of three points: 1. determine the optimal scores for each participant by solving an optimization problem on her/his training data; 2. solve a modified version of the optimization problem in order to implement an efficient early stopping method; 3. include the information on the decision function (the target is the stimulus having maximum decision value) into the training problem The great advantage of our method, is that the calibration phase (different for each participant) becomes completely automatic and does not need any cross validation phase or manual parameters tuning. The paper is structured as follows: in Section 2, we introduce our new decision function, defining the optimization problems to be solved both in the no stopping and early stopping scenario. In Section 3, we introduce a new training problem that keeps into account explicitly the target assignment in BCI, and in Section 4 we derive its Wolfe dual. In Section 5 we report the behavior of our new approaches on several datasets and finally we draw some conclusions in Section 6.

An optimized score based decision function
In [3], a set of heuristically-determined scores has been used to weight and combine the decision values of multiple iterations within an early stopping setting. In this work, we decided to modify the approach by using a set of scores automatically determined by solving a mixed integer linear programming (MILP) problem for each participant. Each stimulus receives a weight according to its decision value: five zones are defined, and each zone gets a different score a,b,c,d,e. In particular, the scores are related to the confidence in the classification of the given stimulus as target: the score a is assigned to the stimulus that is most likely to be the target, whereas the stimuli that are highly unlikely to be the target get score e. All the stimuli in the middle get decreasing scores according to the distribution of the decision values.  The zones are identified by considering the decision values of all iterations for all stimuli in the training set and computing the corresponding quartiles Q1, Q2 and Q3. The idea is to produce scores that reflect the distribution of the data. Figure 1 shows how the scores are assigned depending on the distribution of the quartiles of the decision values. The maximum score a is assigned only if the confidence in the current classification is extremely high: i.e. if the decision value is positive and higher than all the other decision values of the current iteration.
Note that, given the separating hyperplane, the score assignment for each stimulus of each character is known: so, it is possible to build the following binary vectors that represent in a compact form the score vector assignment z for each stimulus of each character: where f = 1 . . . n f and t ∈ T identify the stimulus, k = 1 . . . n k identifies the character, r = 1 . . . n r identifies the iteration and, finally, s = a, . . . , e identifies the score. The score assignments depends on the kind of the primary aim of the BCI: (i) if the main focus is the accuracy, the idea is to use all the available iterations for spelling a character (no stopping protocol), also in the online phase.
(ii) if the idea is to try and speed up the communication, then the performance to be maximized is the transmission rate, trying to reduce the number of iterations needed to spell a character in the online phase (early stopping).
In the next two subsections, we describe the Mixed Integer Linear Programming (MILP) Problems we define in order to find the scores in the two different settings.

No stopping OSBF
First, we propose a strategy to choose the scores when all the iterations are exploited and the primary focus is to increase the classification accuracy. In this setting, we aim to reliability of the classification and we do so by imposing the following constraints: 1. at the last iteration, we require, if possible, that the score obtained by the target stimulus is larger (with some margin if possible, that implies robustness of the classification) than the score of any non target stimulus. This means that we ask not to fail in the classification after the last available iteration; if this is not possible, a suitable binary variable representing the failure on that stimulus is set to one; 2. to make the classification more robust on the test set, we require that in as many iterations as possible, the score of the target is larger than the one of the non target stimuli; 3. as an objective, we try and maximize the accuracy on the training set, and the number of iteration where the classification is robust.
Our main variable in the optimization problem is the vector of scores s = (a b c d e) T . We add an auxiliary variable to try and impose some distance between the score of the target stimulus and the scores of the non target stimuli that we call ∆, and that represents a measure of reliability of the classification. Further, we add some binary variables: • x k,r t : binary variable that is equal to 1 if the target of character k for level t has a score at iteration r that is larger than the score of any non target stimulus plus ∆ • err k t : binary variable that is equal to 1 if the target is not correctly classified for character k at level t, i.e. if at the last iteration the target score is lower or equal to the score of some non target stimulus The MILP problem to be solved is then the following: where l and u are chosen bounds on the possible values of the scores, and M is large enough to make the constraints trivially satisfied when the corresponding binary variable x k,r t is zero. The objective function, that has to be maximized, is composed by two terms: the percentage of success on the training set, and the average number of iterations where the classification is robust and reliable. We then have the following constraints: (i) Constraints (6), (7) and (9) impose that the scores are bounded and that are ordered in decreasing order and differ of at least one; whereas constraint (8) imposes that the first three scores are nonnegative (ii) constraint (10) imposes a lower bound on the threshold to ensure reliability of the classification. Indeed this lower bound ensures that the threshold has a minimum value depending on the scores: in particular s 1 − s 5 + 1 represents the maximum difference in score that can be assigned to different flashes in a single iteration. Therefore, even in the worst possible scenario, where two flashes get the same score, there must be at least one iteration where one gets the maximum score and the other the minimum score to break the parity.
(iii) constraints (11) impose that variable err k t is 1 if and only if x k,nr t = 0, that is it represents an unreliable classification at the last iteration.
(iv) constraints (12) impose that if at iterationr the classification is reliable for the target trg(k, t) of character k at level t, then the corresponding binary variable x k,r t is set to 1.

Early Stopping OSBF
Problem (5) can be modified in order to improve the system performance in terms of speed, implementing an automatic Early Stopping Method, similarly to [3].
The idea is again to use the scores s and the threshold ∆ at each iteration of the test phase to verify an early stopping condition: during the test phase, the stimuli are ordered according to the sum of their scores and, if the difference in score between the first and second stimulus is greater than the threshold ∆, the method classifies the target character and the remaining iterations are not performed.
In order to adapt problem (5) to the early stopping setting, we introduce some further constraints, and modify the meaning of some binary variables: In this case, the objective function keeps into account both the percentage of success (to be maximized) and the time needed for classification (to be minimized). Note that the second term (which represents the trial duration in minutes) was multiplied by a factor 100 to make the two terms of the objective function comparable. We then have some further constraints, since in this case we are interested in the first iteration where the following early stopping condition is met:r In this model, we set the binary variables x k,r t in such a way that it is 1 if and only if the early stopping condition (19) is verified for the first time on the target at iteration r, and it is not satisfied by any non target stimulus earlier. This is imposed by the combination of constraints (12), (16) and (17).
We stress that in both the no stopping and the early stopping scenarios, the MILP problem is solved using the training set data (the same used to build the hyperplane), whereas the score efficiency is evaluated on the test set.

A new training problem
As already pointed out in the introduction, in order to achieve a good classification accuracy it is fundamental to exploit the information that at each iteration there is exactly one target stimulus for each level, assigning then the target class to the stimulus having the maximum decision value. Our idea is to try and add this protocol knowledge already in the training problem.
Given the definition of training set given in 1, the standard training problem to solve in order to find a separating hyperplane according to the SVM approach is the following [22]: min w∈R n ,b∈R In this work, we modify the training problem including the information that the target stimuli should receive the maximum decision value among all the other flashes. Let's denote by trg i the target stimulus for the stimulation sequence where the stimulus i belongs: so, in particular, if i = (k, r, t, f ) we will have: Then, we want to impose: From now on, in order to simplify the notation, we will write constraints (23) in the following more compact form: and we add slack variables to avoid infeasibility, getting the following set of constraints: Now we simply plug these constraints into the primal problem getting the new training problem based on the maximum decision function: min w∈R n ,b∈R where the vector z is defined as:

Wolfe Dual of the new training problem
In order to build the Wolfe Dual of the quadratic optimization problem (27)-(31), it is necessary to introduce the dual multipliers of the constraints: • λ i ∀i ∈ T S: the multiplier associated to constraints (28) • ρ i ∀i ∈ T S : y i = 1: the multiplier associated to constraints (29) • µ i ∀i ∈ T S: the multiplier associated to constraints (30) • θ i ∀i ∈ T S : y i = 1: the multiplier associated to constraints (31) Let us define the vector λ and ρ as the vectors of size l 1 and l 2 respectively containing λ i (∀i ∈ T S) and ρ i (∀i ∈ T S : y i = 1). Then we define the following matrix Σ ∈ (l1+l2)×n : The following proposition holds: The dual problem of problem (27) is Proof. The Wolfe dual of problem (27)-(31) is given by: where L(w, b, ξ, η, λ, ρ, µ, θ) is the Lagrangian of optimization problem (27)-(31) that can be expressed as follows: By rearranging terms equation 42 can be rewritten as: The constraints of the Wolfe Dual (equations 37-40) can now be computed based on the Lagrangian function in equation 43. The equation ∇ w L(w, b, ξ, η, λ, ρ, µ, θ) = 0 leads to an expression for w: = 0 allows to derive µ i as a function of λ: whereas ∂L(w,b,ξ,η,λ,ρ,µ,θ) ∂ηi = 0 results in an expression of θ i as a function of ρ i Non-negativity of the multipliers λ, ρ, µ, θ combined with equations (46) and (47) result in the following set of constraints: We can plug equations (46) and (47) in the objective function, getting: The Wolfe Dual of problem (27)-(31) can then be expressed by using equation (50) as objective and equations (44), (45), (48), (49) as constraints.
Note that: Figure 2: Graphical representation of the AMUSE paradigm in which six speakers are places all over the subject. In the first level, each speaker is used to represent a set of characters, while on the second level each speaker is used to represent a single character among the previously selected set.
Let us define the vector λ and ρ as the vectors of size l 1 and l 2 respectively containing λ i (∀i ∈ T S) and ρ i (∀i ∈ T S : y i = 1). Then we define the following matrix Σ ∈ (l1+l2)×n : The dual problem can then be rewritten as that is still a quadratic convex programming problem.

Dataset
We tested our approaches on five different datasets: AMUSE The protocol is based on auditory stimulus elicited by means of spatially located speakers, we have two levels, 15 rounds, six classes for each level, see Fig. 2 [27]. It is performed on healthy subjects and downloadable by the BNCI horizon website [1].

P300 Speller
The protocol is the classical P300 Speller [7], performed on 10 healthy subjects.

ALS P300 Speller
The protocol is the classical P300 Speller [7], performed on 8 patients suffering of Amyotrophic Lateral Sclerosis (ALS).
MVEP It is a visual protocol in which a moving pattern generates a movementonset visual evoked potential that is used to recognize the user's choice. This protocol is based on modifications of Cake Speller protocol [32]. Sixteen healthy subjects have been involved in the study.
Center Speller It is a visual protocol where we have a visual stimulus elicited by means of three different stimuli, two levels, 10 rounds, six classes for each level [32]. It is performed on 13 healthy subjects.
Akimpech It is a P300 Speller performed on 27 healthy subjects, the number of characters is 16 with 15 iterations for each character in the calibration phase, whereas in the online phase changes depending on the subject.
All EEG signals were pre-processed and features were extracted with the NPXLab Suite [4]. Details of the datasets are reported in Table 1. Please note that we have evaluated our strategy on EEG data recorded from 95 subjects thus assessing its generalization capabilities. Two principal pre-processing operations were applied: • Electrodes selection: for the datasets Center Speller, MVEP, and AMUSE we kept the electrodes belonging to the 10-20 EEG placement. This strategy allows us to reduce both the dimension of the dataset and the overfitting; • k-decimation: this technique was applied to all datasets in order to reduce overfitting. In this case, we down-sampled the EEG signal from every electrode by replacing each k consecutive samples with their average value.
Furthermore, let's recall that the OSBF strategy requires to compute the quartiles of the training set decision values in order to assign scores to stimuli. In this scenario, we stress that, for the standard P300 Speller's paradigm, stimuli corresponding to the intensification of rows and columns are considered separately; in fact, we observed that the distribution of the decision values was different for row and column stimuli. The other paradigms we considered are based on two-levels of selection: in this case, we considered stimuli corresponding to the outer and inner level together for computing the quartiles, since we observed similar distributions of the decision values.

No stopping scenario
As a first step, we evaluate the impact of choosing the scores by solving the problem (5). We compare our strategy with both the classical DV med approach and the SBF decision function [3] where we sum up the heuristically determined scores for all the available iterations (i.e., we use it in a no stopping fashion). We build the separating hyperplane by training a linear SVM with the package   Liblinear [6]. We try both the L1 and L2 loss, and since there is no clear winner, we report the results obtained with both the losses. Table 2 shows the accuracy -i.e. the percentage of correctly classified characters -obtained by the different approaches. Findings in Table 2 show that the OSBF outperforms the other two approaches since it reaches the highest accuracy on all the datasets. Please note that the OSBF is computationally cheap since the solution of problem (5) is extremely fast, and does not require any cross-validation phase. In order to further improve the accuracy, we try and build the hyperplane by solving the dual problem (36). We call this approach M-SVM. In order to solve problem (36), we apply a modification of the dual coordinate algorithm as described in the Appendix A. The results obtained by OSBF applied to the M-SVM improve only on some datasets as shown in Table 3, with a significant improvement on the two most difficult datasets: the one containing ALS patients and AMUSE. The intuition was that it could help only when standard SVM is not "good enough". In order to better understand the contribution of the new training problem, we look at the single-subject results, dividing the participants (across all the datasets) into two classes: Class 1 subjects where the standard SVM problem is better than the new M-SVM; Class 2 subjects where the standard SVM problem is worse than the new M-SVM.
In Table 3, we report the average accuracy on both classes, and it is quite evident that the new training problem helps whenever the starting accuracy is not too high. When the starting accuracy is high, the performance does not change or gets worse probably for the overfitting. Interestingly, adding the constraints on the maximum decision value can be interpreted as a form of data augmentation. Indeed, if we include the bias b into the vector w, augmenting each data point in the training set with a last component equal to 1, we can reinterpret the constraints (25) as standard sign constraints imposed on the

Early stopping scenario
As a second step, we consider the early stopping version of both the SBF (that is the current state of the art for early stopping methods) and the OSBF. In order to evaluate the performance of the proposed method with the respect to the number of iterations needed for an accurate classification also the theoretical Information transfer rate (ITR, bit/min) has been computed. The ITR is a communication measure based on Shannon channel theory with some simplifying assumptions. It can be computed by dividing the number of bits transmitted where N is the number of possible symbols in the speller grid and P is the probability that the target symbol is accurately classified at the end of a trial. From (60) the ITR is computed as: In (62), SOA refers to the stimulus-onset asynchrony; f s represents the number of stimuli in each stimulation sequence and i is the mean number of used iterations to select a symbol. In Tables 5, 6, 7 and 8 the results obtained with the early stopping setting are shown. Findings in Table 5 further corroborates the potentials of the OSBF since its outperforms the SBF, no matter what hyperplane is used. In Table 6 we compare the early stopping results in terms of accuracy obtained with the OSBF with the different hyperplanes (L1-SVM, L2-SVM and M-SVM): we can notice that, in this case, the M-SVM reaches a higher level of accuracy than the other methods among almost all datasets. Tables 7 and 6 show the results in terms of theoretical ITR. In this case, we can see that all the strategies reach comparable results and there is not a clear winner.
We can then conclude that the OSBF strategy is a more conservative approach than the SBF, since it manages to keep a high level of accuracy preserving the communication speed.

Conclusions
This paper focuses on the classification problem that arises in many BCI protocols. The idea was to exploit the knowledge on the protocol in order to improve    (i) the use of a MILP problem to assign a "reliability score" to the classification of each stimulus in every iteration (i) the definition of a new training problem that keeps into account that the target class is assigned to the stimuls having the maximum decision value.
Both novelty elements have been applied in two different scenarios: a first one where accuracy was the main focus and all the iterations available for each subject were used both in the calibration and the online phase; a second one where the focus was to improve the communication speed, and hence an early stopping strategy was implemented in the online phase. In order to evaluate the approaches we conducted an extensive experimentation on datasets coming from different protocols and including both healthy subjects and ALS patients.
The results show how we were able to improve accuracy and ITR on all the datasets, proving once more that combining machine learning tools to problem knowledge can significantly improve performances.
[32] Matthias Sebastian Treder, Nico Maurice Schmidt, and Benjamin Blankertz. Gaze-independent brain-computer interfaces based on covert attention and feature attention.

A Dual Coordinate Descent Algorithm
In this section we describe how we modified the Dual Coordinate Descent Algorithm proposed in [10] in order to find the separating hyperplane for problem 27-31. The Dual Coordinate Descent Algorithm basically solves the dual problem applying a Gauss Seidel decomposition method where each variable constitutes a block, and the subproblem with respect to a single variable is globally solved analytically. We adapt the algorithm by modifying the following points: • how the gradient of the objective function is computed; • how the hyperplane is updated.
In particular, we can write the objective function f and the separating hyperplane w as: Let's define the vector α = λ T ρ T ∈ IR l1×l2 . Equations 63 and 64 can equivalently be defined with respect to vector α. We can then express the i-th component of the gradient of f (α) as: which can be rewritten as:

B Detailed numerical results
As a supplement, we provide the detailed results obtained for all subjects for all considered datasets. In Table 9 we provide the results obtained in the no stopping setting, while in Tables 10 and 11 the results for the early stopping setting are reported.