Fall compensation detection from EEG using neuroevolution and genetic hyperparameter optimisation

Detecting fall compensatory behaviour from large EEG datasets poses a difficult problem in big data which can be alleviated by evolutionary computation-based machine learning strategies. In this article, hyperheuristic optimisation solutions via evolutionary optimisation of deep neural network topologies and genetic programming of machine learning pipelines will be investigated. Wavelet extractions from signals recorded during physical activities present a binary problem for detecting fall compensation. The earlier results show that a Gaussian process model achieves an accuracy of 86.48%. Following this, artificial neural networks are evolved through evolutionary algorithms and score similarly to most standard models; the hyperparameters chosen are well outside the bounds of batch or manual searches. Five iterations of genetic programming scored higher than all other approaches, at a mean 90.52% accuracy. The best pipeline extracted polynomial features and performed Principal Components Analysis, before machine learning through a randomised set of decision trees, and passing the class prediction probabilities to a 72-nearest-neighbour algorithm. The best genetic solution could infer data in 0.02 s, whereas the second best genetic programming solution (89.79%) could infer data in only 0.3 ms.

The scientific contributions of this work are as follows: Genetic programming of learning pipelines provides the strongest models for fall detection from EEG; the algorithm is executed for five individual iterations, leading to the five best overall results. Two methods of hyperheuristic optimisation are explored; (a) evolutionary optimisation of neural network hyperparameters (neuroevolution), and (b) genetic programming of ML pipelines. The results show that when neuroevolution is executed for fifty generations over five iterations, deep learning finds difficulty for the available fall detection data. The best neuroevolved artificial neural network achieves 73.41% mean accuracy, which is worse than the ten other algorithms explored and better than 15 other algorithms. Exploration of the pareto frontiers of accuracy versus training time and accuracy versus inference time shows that the best model scores 90.52% mean accuracy and can infer brain activity in 0.019 s per prediction. The second-best model scores a slightly lower 89.79% mean accuracy but can infer data objects in only 0.3 ms. All results are made open source, with Python code provided that is compatible with Scikit Learn.
The remainder of this article is as follows; Sect. 2 provides a background and review of the literature relevant to this study. Section 3 then describes the methodology followed by each of the experiments included in this work. Section 4 presents the results of all experiments, with Sects. 4.4 and 4.5 presenting the results for neural network neuroevolution and genetic programming of the ML pipelines, respectively. Finally, Sect. 5 concludes the findings of this study and presents suggestions for future work based on them.

Background and related work
Falls are most commonly caused by gait instability, confusion and agitation, urinary incontinence and frequency, and the use of prescription sedative and hypnotic drugs, according to a review by Oliver et al. [14]. Ageing, in general, leads to a decrease in balance [15]. Many injuries that occur commonly for people over 65 years of age are more severe and often preventable [16]. According to [17], 33% of adults considered to be elderly will experience a fall on average once per year. This risk is noted to rise later in life, with 50% of over 80's suffering one or more falls per year. In the United States, there were more than 2.6 million fall-related injuries in year 2000 [18], of which 10,300 were fatal. Most of the time, serious physical injury or death does not occur but does lead to a loss of confidence, social withdrawal, and a feeling of lost independence [19]. This feeling of lost independence may lead to a higher incidence of depression [20]. Fall detection is the use of technology to automatically recognise when someone has fallen, which can then lead to healthcare providers or family members being alerted without any human intervention required. Automatic fall detection alleviates problems after a fall related to situations in which an emergency call button or cord cannot be reached [21]. Studies have shown that a fall event can be detected through several proposed methods that include, but are not limited to, wireless networks [22], computer vision [23], thermal image processing [24], acoustic classification [25], and HAR through wearable sensors [26].
Adkin et al. [27] report that compensatory balance reactions are recognisable within the recorded EEG data. There exists a great overlap in the functions of each of the brain's lobes, but it is currently understood that much of the coordination involved in balance takes place in the cerebellum, since damage to this area of the brain can negatively affect balance and posture [28,29]. Vice versa, the volume of the cerebellum was found to be larger within a subject group of high-speed ice skaters [30]. In [31], researchers found that there were significant levels of brain activation during falls within the frontal lobe, specifically the prefrontal cortex, the dorsolateral prefrontal cortex, and the frontal eye field. Given that the cerebellum is found deep within the brain, partly obscured by the cortex, the frontal lobe provides much easier access by non-invasive EEG. Consumer-level technology is the goal of the conducted research in this article, therefore the frontal lobe is selected as the most promising candidate for detecting fall compensatory behaviour. All of the subjects in this study were both healthy and conscious, and thus exhibited normal frontal lobe activity.
In Annese et al. [32], the authors proposed multimodality learning from both EEG and Electromyography (EMG) signals towards machine learning-based fall risk prediction within the design of a specialised digital processor. In this study, EEG focused on the motor cortex and EMG electrodes were placed on the leg muscles. Findings showed that a fall event could be detected 500ms prior to its occurrence because of the brain's ability to anticipate and compensate for such events. The results in the dataset were almost perfect. The authors note the computational expense of the approach, and it is also worth noting that placing EMG on the legs and EEG with a cap is inconvenient and, therefore, not suitable for everyday use.
A more consumer-ready solution was presented in [11]. Their study explored the use of a helmet with embedded EEG electrodes for the classification of fall events. The dataset collected by the authors was classified at around 98% by a random forest ensemble. The authors note the complexity of having such an exhaustive EEG array and that, in the future, there may be methods of increasing prediction efficiency from an array of fewer electrodes. In addition to the financial costs involved with the trade-off between clinical and consumer-level sensors, errors arising from signal noise (which are more common when operating cheaper sensors [33]) can also be a source of problems for activity recognition. Machine learning-based approaches have been shown to be promising in the removal of signal artefacts [34,35]. Lowcost sensors have been found to be prone to a variety of problems, and LaRocco et al. [36] argued for the need of algorithmic optimisation.
In the context of artificial intelligence, neuroevolution is a process in which evolutionary algorithms are implemented to generate the hyperparameters of an artificial neural network, given that their selection is a problem of combinatorial optimisation [37]. These hyperparameters can include the topology of the network, that is, how wide the hidden layers are and how deep the network structure is, alongside parameters such as their activation functions, learning rates, and momentum among others. It is, therefore, a form of Automated Machine Learning (AutoML), wherein complex sets of parameters present as the search space [38], with fitness of the solution derived from the ability of the neural network (e.g. from backpropagation on data).
The use of Neuroevolution has recently gained popularity in biological signal processing due to its promising ability to engineer appropriate models. In [39], researchers proposed the use of neuroevolution for the classification of surface electromyography signals towards recognising hand gestures. Through the application of NeuroEvolution of Augmenting Topologies (NEAT) [40], results noted a mean classification accuracy of 88.76% on signal windows of 150ms. Similarly in the EEG domain, neuroevolution was proposed for the selection of channels prior to learning [41]; in this study, 64 channels of signals posed a problem prior to machine learning, and were gathered from four trans-humeral (upper-arm) amputees. Results showed that a particle swarm optimisation algorithm outperformed other heuristics.
Inspired by the findings of literature review, and given the noted research gaps, the goals of the experiments in this study are to employ meta-heuristics to explore hyperparameter optimisation through neuroevolution of deep neural networks and the genetic programming of machine learning pipelines (including individual hyperparameter sets and ensembles). The algorithms discovered by these approaches provide additional approaches to fall detection, and are made open source with the provision of Python source code in Appendix A for genetic Fig. 1 The NeuroSky MindWave headset which was used to collect EEG fall data in [42] Fig. 2 Lobes of the brain involved in fall prevention and compensation. The approximate placement of the EEG sensor is denoted by the circle programming pipelines, and Appendix B for neuroevolutionary neural network hyperparameters.
The NeuroSky EEG headset shown in Fig. 1 has a single electrode placed in the FP1 position within the 10-20 EEG electrode placement system. The Neuro-Sky is most often worn in the position that can be seen in Fig. 2. Although many of the commercial applications of the device are based on concentration classification [43], the NeuroSky has proposed applications in fatigue detection [44], blink detection [45], and fall detection [42].

Method
This section describes the methodology followed by the experiments carried out in this article. Firstly, the dataset and data preprocessing are detailed before explaining the hyperparameter optimisation and learning approaches. The aim of these studies  is to explore hyperheuristic techniques to improve the detection of falls via biological signals classification. Figure 3 shows the general approach used by the final outputs of this study. EEG signals are recorded in real-time from the sensor placement as detailed previously in Fig. 2. Following feature extraction, the model classifies whether or not fall compensation behaviour is occurring.
The diagram in Fig. 4 shows the optimisation of feature and model spaces, via genetic programming exercises that treat classification ability as fitness metrics. The goal of this algorithm is to improve the ability of fall compensation detection via EEG, a problem that persists due to the low quality of consumer-level EEG compared to clinical approaches. The neuroevolution approach takes place in the model space only, with hyperparameters of topology, activation, and loss function optimised.

Dataset and pre-processing
The initial dataset used for this study is the Preliminar Fall-UP Dataset [42]. This dataset, collected in 2019, comprises 11 physical activities performed by 4 subjects for three trials each. Of these activities, a fall event occurred in five and did not occur in six. Since fall events tend to happen quicker than non-fall events such as walking or standing, the dataset is imbalanced when considering binary classification. The activities can be observed in Table 1 alongside the binary class label applied for fall compensation detection. Only the Neurosky EEG brainwave data is used from this dataset.
Feature extraction from the data is required since waves are temporal, i.e. information is presented over time rather than from one singular data object. Time-windowing is a suitable method to extract descriptive information on a per-data-object basis. Feature extraction is the process of extracting these statistical descriptions for classification, and the usefulness is noted in several studies [46][47][48][49][50]. Furthermore, wavelet characteristics have been identified as particularly good features to inform the description of EEG signals [51,52]. The feature extraction process for this work is as follows; the signals are initially divided into 0.5 s windows, and seven sets of features are extracted, leading to 39 individual features. The spectral entropies of the signals are computed via Fourier transform, which is given as is also extracted, where P is the power spectrum and probability distribution of the input signal. For each wavelet scale up to 8, several features are extracted following a continuous wavelet transform. They are the absolute mean, energy, entropy, standard deviation, and variance. All features are normalised via min-max scaling on the scale 0 − 1 . We first explore the relative entropy to observe how much information each of the attributes carry for prediction, as well as which features are particularly useful, if any. The application of the aforementioned algorithms to the data leads to single data objects that describe a temporal sequence. For example, a single reading from a signal would give one value at a given timestep. This reading holds no useful information to classify the signal, since this information is derived over time given the behaviour of the wave. Given that most machine learning algorithms do not take into account temporal sequences, the methodology of this work instead generates mathematical descriptions of time windows to provide input data. To conclude this, one row of processed data describes half a second of EEG data and can be used as input to any machine learning approach, rather than being limited to those that are temporal.
Following preprocessing, it was observed that the classes were imbalanced at an approximate 5:1 ratio for non-falling and falling, respectively. Falling data was represented by 1102 data objects whereas the non-fall class had 5032, leading to a mean class label of 0.18 (where 0 is for non-fall and 1 is for fall) at a standard deviation of 0.384. For balancing, a simple random (seed = 1) undersample of 1102 nonfall instances are taken to provide a balanced dataset of 2204 data objects in total.

Hyperparameter optimisation and learning
The hyperparameter optimisation processes for the three sets of experiments is explained in this section. Firstly, classical linear and batch searches of statistical model parameters. Secondly, the optimisation of deep neural network topologies, activation functions, and loss functions. Finally, the third set of experiments is described in which a tree-based genetic search is used to optimise statistical model pipelines (that is, those detailed in Fig. 4).
Hyperparameters for the K-Nearest Neighbour [53] (KNN) and Random Forest [54] models are initially explored through a simple linear search k = {10, 20, .., 90, 100} to discern whether hyperparameter tuning has a noticeable effect on predictive ability. KNN is a clusetering algorithm which classifies an unknown data object by its Euclidean distance to labelled points in n-dimensional space, where n is the number of attributes. Random Forests are an ensemble of Random Decision Trees (RDTs) voting on prediction, and RDTs classify data based on splitting to reduce entropy. Various ML algorithms are selected with a range of different statistical methods to provide a general overview of the classification ability using multiple methods (see Sect. 4.6 for more details). Following this, further tuning is performed via Adaptive Boosting [55] on all the selected models that are compatible with the algorithm.
The second set of hyperparameter optimisation experiments involves the evolutionary optimisation of neural network parameters, and the genetic programming of machine learning pipelines. The simulations are executed five times, with a population size of 30 for 50 generations. Each initialisation of the two search algorithms is given random seeds equal to their iteration, one through five.
The controlled hyperparameter limits for the evolutionary search experiments can be found in Table 2. Larger ranges from the original studies were attempted manually prior to experimentation, revealing severely low classification metrics. Each of the networks are given 300 epochs to train at a batch size of 200.
Finally, a Genetic Programming (GP) approach is then explored using a treebased algorithm, as detailed in [56]. The GP tree is given access to all of the algorithms included with scikit-learn alongside the Extreme Gradient Boosting library [57]. The GP algorithm runs for a total of 50 generations, with a population size of 20. Mutation rate is set to 0.9, and there is a crossover rate of 0.1.
All algorithms in this work are trained by 10-fold cross-validation with a seed set to 1 for randomisation and are therefore directly comparable. For all heuristic searches, the population size was 25 simulated for 50 generations. The probability of crossover was selected as 0.8 and mutation at 0.1. The evolution strategy was ( + ) . All algorithms were trained on an overclocked Intel Core i7-8700K CPU (4.3GHz) with scikit-learn [58], DEAP [59], and TPOT [56].

Results
This section presented the results of all planned experiments. First, we will explore the usefulness of the features extracted from the signals in conjunction with related observations. Following this, hyperparameter optimisation of models is explored through the selected methods of linear searching, neuroevolution, and genetic programming. The source code for the main experiments is given in Appendices A and B.

Data preprocessing
Following the preprocessing and subsampling (for class balance) strategies described in Sect. 3.1, the dataset comprised of 2204 samples. The measurements of relative entropy by 10-fold cross-validation are presented in Table 3. It can be observed that three features, in particular, carry more information relative to the rest; those were the absolute mean on the eighth wavelet scale, the variance of the third wavelet scale, and the variance of the fourth wavelet scale. An example as to why class balancing is used can be observed in Table 4. When the dataset is unbalanced, there is a higher frequency of EEG signals linked to activities related to not falling, and thus they are much easier to classify on average. Due to this, misleading results can be achieved; for example, the class balanced approach has a lower classification accuracy (83.3% vs. 92.21%), the ability to recognise the falling behaviour  is improved from 885 correct instances to 980. The baseline (random guess or application of the most common label) for the balanced dataset is 50% while it is 82.03% for the unbalanced dataset. Therefore, balancing in this preliminary example provides a 33.3% advantage over the baseline, whereas leaving the dataset unbalanced provides only a 10.18% advantage. When normalising, each of the values of attributes then shares a common scale, without distortion of ranges or information loss. A preliminary experiment is performed on the normalised and non-normalised data in Table 5. It is observed that the classification metrics increase slightly after normalisation is used as a preprocessing technique. Due to these examples and discussion, the normalised and equally balanced dataset is chosen for the remainder of the experiments presented in this work.

Hyperparameter tuning
A comprehensive investigation was conducted to find out the linear hyperparameter search for estimators in the Random Forest. The overall best approach of this search was a random forest of 80 trees, which had a mean accuracy of 84.94%. The model also had a precision of 0.81, a recall 0.915, and an F-Score of 0.856. These were the highest observed metrics within the linear search except for mean precision, where a Random Forest of 50 trees scored 0.81.  Following the same approach, a linear search of hyperparameter k in K-Nearest Neighbours can be observed. The strongest approach revealed during the search was k = 40 , which had a mean accuracy of 73.37%, a precision of 0.793, a recall of 0.634, and an F score of 0.704.

Adaptive boosting
Models that had the ability to predict probabilities, and thus are compatible with the adaptive boosting algorithm [60], were adaptively boosted. The boosting results can be observed in Table 6 with a comparison between models then presented in Fig. 5. Adaptive boost leads to lower results on more than one occasion. Random Forests and Naive Bayes models lead to a lower mean classification accuracy. On the other hand, Logistic Regression and Stochastic Gradient Descent models can be improved with boosting. It must be noted that boosting is computationally expensive compared to many of the approaches explored in this work.

Neuroevolution of network topology
Following five evolutionary topology searches, solutions were presented with varying sizes of neural networks with different hyperparameters. Figure 6 shows how the mean accuracy of best solutions evolved over generations. Since many of the solutions presented close results, the search often stagnated relatively early, especially Iteration 2. The best solution found was by that of Iteration 3, which scored a mean 73.41% accuracy. The hyperparameters selected for this neural network were three hidden layers of 29, 10, and 9 neurons with a hyperbolic tangent activation function. The entire source code for the neural network hyperparameters  can be found in Appendix B. Note that extremely fine values of parameters such as alpha and beta values, learning rates, and momentum were selected, to an extent that would not be tested manually or by batch search. The final results of all evolutionary neural network searches can be observed in Table 7, and the trade-off between accuracy versus training and inference times can be found in Table 8. It is worth noting here that although the models take much longer to train compared to the other algorithms explored in this work, there is no payoff in terms of gaining accuracy.

Genetic programming
As previously described, the genetic programming approach explored 50 generations with 30 solutions as a population size. The learning process for five iterations of the GP algorithm can be observed in Fig. 7, and the best final solutions are further detailed in Table 9. Although starting at the highest fitness, Iteration 1 had the lowest final score of 88.79%, with Iteration 2 (which started at the lowest fitness) scoring slightly more by the end of the simulation at 88.79%. The best solution found was that by Iteration 3, which scored 89.34%. Due to their complexity, the solutions are presented by their iteration ID in this work-the source code for all three machine learning pipelines can be found in Appendix A. Although wavelet features are extracted manually, it can be observed that there was further feature engineering through Principal Component Analysis (PCA) and Polynomial combinations, which are often also suggested in the literature [61][62][63]. Table 10 shows the tradeoff between model complexity as training and inference times compared with the average ability of the model. It can be noted that, although some algorithms were more complex and required considerably more resources, there was a diminishing return on ability. In fact, these models were outperformed by algorithms that could train in under one second. GP1-5 denote the best solutions after five individual Genetic Programming searches. Although the solution found by GP1 had the highest accuracy at a mean value of 90.52%, it took considerably longer to train than the other solutions at around 4.4 s. The second-best solution, GP2, took the least training time of around 0.87 s and achieved a mean accuracy of 89.79%. The second best solution also had the smallest inference time, at only 0.0003 s per prediction. The best model extracted polynomial features and performed PCA on them prior to the prediction probabilities of an extra trees classifier were presented as input for a KNN of 72.

Comparison of all models
A final comparison of all models is provided in Table 11. For readability purposes of the table, a key to the abbreviations is as follows: RF-Random Forest, AB-Adaptive Boosting, KNN-K-Nearest Neighbours, LDA-Linear Discriminant Analysis, LR-Logistic Regression, L SVM-Linear Support Vector Machine, RBF SVM-Radial Basis Function Support Vector Machine, SGD-Stochastic Gradient Descent, QDA-Quadratic Discriminant Analysis, NB-Gaussian Naive Bayes. As can be observed, the best models were those that were explored through genetic programming. Interestingly, the adaptive boost of the Naive Bayes model was worse than random guessing, and this was the only instance of such an occurrence. The Receiver Operating Characteristic (ROC) and Precision-Recall curves are useful

Computational complexity versus accuracy tradeoff
The trade-off between model ability and complexity can be observed in Table 12. Additionally, Figs. 10 and 11 show a visual representations of accuracy versus training time, and accuracy versus inference time, respectively. The best single model bar genetic programming, the Gaussian Process, scored a mean accuracy of 86.48% but required around 359 s to train, which was by far the highest computational requirement of all models. Although the training time was high, inference took only 0.33 ms per data object. As can be observed, the genetic programming solutions have training times similar to most other models while attaining the highest classification accuracy scores on average. Inference time is an important aspect when it comes to the real-world application of the approach. The time taken to infer a data object is one and the same with the time taken to detect when someone has fallen. Thus, it is important to consider the accuracy/inference tradeoff when choosing a model given that it will affect the response time of a fall detection system. Therefore, even though the second solution is slightly Fig. 11 Tradeoff between accuracy and inference time for all algorithms worse at an average accuracy decrease of 0.73%, it is likely a more appropriate choice for use due to the inference time of 0.33 ms, as opposed to the 18.92 ms inference of the best performing model. Finally, the best model, GP1, is validated by training based on three subjects and testing the remaining subjects, with the results presented in Table 13. Through leaveone-subject-out cross-validation, we observe a mean accuracy of 75.04%. The highest ability was found to be when Subject 1 was the test, scoring 82.95% accuracy. It was relatively more difficult to generalise to Subject 4, where the model reached only 67.02%. These results show that the dataset is not diverse enough for generalisation between individuals, and experiments to benchmark this are limited by the fact that only four subjects are present. This suggests that future experiments should concern a larger group of subjects, with the aim for further generalisation.

Conclusion and future work
To finally conclude, this work has explored how techniques such as genetic programming and neuroevolution may provide solutions to learning from low-cost and low quality EEG electrodes for consumer use. Fall detection from this electrode was observed to be a difficult task, but hyperheuristic solutions found machine learning pipelines and hyperparameter sets to enable this possibility. Although the problem was difficult, due in part to activities such as laying down being present in the category of not falling, genetic programming developed a machine learning pipeline that could detect falls with an average accuracy of 90.52% in only 0.019 s per data object. The second best solution, also a genetic programming pipeline, achieved a slightly lower 89.79% mean accuracy but could classify data objects in only 0.3 ms.
The results presented in this work provide a good basis for further experiments in the future, given that some approaches were particularly worse than the more impressive set of results. Most interesting of these future works concerns the leave-one-subject-out validation, which found that some subjects could be generalised to with high accuracy, and some faltered compared to the overall metrics observed through 10-fold cross-validation. In future, and with a larger dataset, work could concern to which extent this activity can generalise more when observing a greater number of subjects. Given a larger number of subjects, generalisation can be aimed for as a goal of the Genetic Search via deriving fitness scores from leave-one-subject-out cross validation. Due to the nature of the algorithm, there is an experimental limitation within the search for neural net hyperparameters, since layers two and three were both a problem space of 1 − 128 , the solution space was thus x, n in some cases provided that layer 2 was 0 and layer 3 was > 0 . Therefore, a small number of solutions were identical but treated as unique individuals. In addition to running algorithms for a longer period of time, future work could also concern testing other hyperparameters for the evolutionary strategies, although this would add a further layer of complexity to the approach and would require considerably more computational resources. These include the population size, crossover and mutation rates, as well as the overall mutation