1 Introduction

During 2019, around 968 million people suffered from some form of mental disorder, that is 1 out of 8 people around the world. One year later, because of the COVID-19 pandemic, this percentage increased significantly, rising to 26% for Anxiety disorder and 28% for Major Depressive Disorder (MDD) [1]. One of the most common disorders suffered by the population is MDD. Indeed, suicide is the third cause of death among 15-29-year-olds, influenced by MDD. Regarding young people, around 1 out of 5 children and adolescents suffer from some mental health issue. In addition, people with severe mental disorders have more chances to die prematurely in comparison with neurotypicals, specifically 10 to 20 years earlier in high-income countries and up to 30 years earlier in low-income countries [1]. Therefore, it is paramount to make a reliable diagnosis as early as possible. To the best of our knowledge, there is still a gap between people needing to be diagnosed, and access to effective and low-cost healthcare. With this work, we aim to help create an accurate, reliable, and accessible diagnosis, through the collection of Feature Engineering (FE) techniques, as well as Artificial Intelligence (AI) models that other researchers have used mainly to diagnose mental disorders.

The Diagnostic and Statistical Manual of Mental Disorders Fifth Edition (DSM-5) acts as a standard reference for psychiatry and it includes more than 450 different definitions of mental disorders [2]. This fact highlights the impact that mental disorders have on individuals and society in general. According to [3], over a third of the total European population suffers from mental disorders nowadays, and only a third of all cases receive some kind of treatment, concluding that the burden of mental disorders has been considerably underestimated. Therefore, having efficient methods for an early diagnosis of mental disorders would be extremely helpful.

There are several procedures to diagnose mental disorders such as neurological exams [4], neuropsychological assessments [5] and neuroimaging modalities [6]. Physicians specialized in mental disorders usually turn to neuroimaging approaches for help and to improve the efficacy of the treatments. Neuroimages can be divided into two categories, depending on what type of data they collect: functional and structural. Functional neuroimages show information about the activity of the brain, while structural neuroimages capture the interior structures of the brain. Some of the most used functional neuroimages are Magnetoencephalography [7], functional Magnetic Resonance Imaging (fMRI) [8], Electroencephalogram (EEG) [9], and Positron Emission Tomography (PET) [10]. On the other side, part of the most usual structural neuroimaging modalities is structural MRI (sMRI) [11], Diffusion Tensor Imaging (DTI) [12] and Computed Tomography (CT) [13]. This study assesses papers that deal with functional data, specifically with EEG modalities, for the reasons set out below.

Electroencephalography consists in recording brain activity by measuring voltage fluctuations of brain regions via the placement of small electrodes around the scalp. This record is called EEG and it is widely used in the study and diagnosis of brain disorders such as Epilepsy [14], Dementia [15], Schizophrenia [16], and Alzheimer’s Disease (AD) [17]. The most remarkable advantages of EEG are the following. First, EEG devices are relatively portable, easy to set up and non-invasive. Second, these devices are characterized by their high temporal resolution, being capable of recording brain signals with up to 1 millisecond of resolution, even though their spatial resolution is worse compared with other methods such as MRI. Finally, EEG devices are relatively inexpensive, compared to other technology devices used to collect brain data, such as CT scanners or fMRI and PET devices. As a result, the use of EEG is a good candidate for an efficient and affordable diagnosis of mental disorders; especially since it is easy to use in underdeveloped countries, where quality healthcare is not fully accessible.

Traditionally, EEG was visually interpreted by highly specialized experts, and it was characterized as being a difficult and time-consuming task, as the volume of information that EEG data provides is considerably large. Because of this, the use of AI techniques has been proposed to automate the process and to aid in the diagnosis and study of mental disorders. Such techniques fall into two subsets of AI itself, defined as Machine Learning (ML), and a subset of ML, Deep Learning (DL). One of the most common tasks in the field of EEG and mental disorders is classification, i.e., an ML model takes several features derived from EEG data as input and outputs a prediction, e.g., whether a patient has a mental disorder or not. The input features are extracted from the raw EEG by applying FE. Extracting and choosing the right set of features for a given problem is one of the most relevant factors, as it can make the difference between an unusable ML model and a simple, effective model. In other words, it can be said that the FE is crucial, especially when using data such as EEG.

Indeed, properly applying FE on EEG data to train AI models related to brain disorders is still a challenging task, as there is no general FE pipeline that performs well on every task. For example, the authors of [18] showed that the beta band power was a relevant feature for detecting individuals with Insomnia, as they had significant and robust increases in that feature, whereas [19] showed that features such as the Variance, Energy, Nonlinear Energy and Shannon Entropy of the raw EEG signals were relevant for the task of epileptic seizure detection. In other words, the set of relevant features depends on the task and/or dataset, and properly applying FE remains a challenging task.

Given the importance of FE in the diagnosis of mental disorders by means of ML, it is clear that a secondary study that compiles the works in this area would foster the development of new techniques and lead to improvements in the diagnosis. Therefore, in this paper, we present a Systematic Mapping Study (SMS) with the purpose of clearly showing which FE techniques and ML models have been applied to each mental disorder in order to provide a way to easily find new research opportunities within the field. There are some secondary studies (reviews, surveys, and similar studies) on the topic of EEG and ML models applied to brain disorders, such as [20, 21]. Nevertheless, our work contains more significant contributions, as we can see in Table 1. Moreover, we will also share some insights and issues that we found after carefully analyzing the results of the SMS, as well as providing recommendations related to research directions. In order to help researchers to introduce new research opportunities discovered via this SMS, it is also included a brief description of other secondary studies that we found when collecting papers which can act as a starting point for future investigation gaps. It is worth noting that we do not report the efficiency achieved in each paper. That is mainly because it would not be correct to compare the accuracy obtained with different databases, since almost every paper selected uses a different one. In addition, as we present an SMS, we have only analyzed the abstract of each paper due to the number of works selected and we were not able to gather the databases used by reading only the abstract.

Table 1 Main contributions of our work in comparison with the closest related works [20, 21]

The rest of this paper is structured as follows: Sect. 2 presents the relevant concepts and background used in the diagnosis through the EEG field. Works related to the present study will be presented in Sect. 3. The methodology used in this work is thoroughly described in Sect. 4, including the statement of research questions, how the search was conducted, the definition of inclusion and exclusion criteria, screening of papers, selection of keywords and the results of the whole process. In addition to the methodology, we also provide a description of reviews focused on this topic that could act as a starting point for new researchers in the field. Finally, a discussion of the results is presented in Sect. 5, followed by the conclusions of the study in Sect. 6.

2 Brain disorders, EEG, FE and ML

In this Section, we introduce the necessary concepts required to follow this paper. First, we briefly describe the brain disorders selected in Sect. 2.1. A description of EEG will be presented in Sect. 2.2, followed by an overview of FE techniques in Sect. 2.3 and a brief explanation of ML in Sect. 2.4.

2.1 Brain disorders

One of the objectives of this paper is to bring together all the FE techniques that have been used to classify, by means of AI, different brain disorders, as they affect society to a great extent. Along this subsection, we present a mental disorders overview and a brief description of the brain disorders selected.

According to [2], a mental disorder is described as a syndrome characterized by a clinically significant alteration in an individual’s cognitive state, emotional regulation, or behavior that reflects dysfunction of the psychological, biological, or developmental processes underlying his or her mental function. Mental disorders can affect important areas of life, such as school or work performance, relationships with family and friends, and the ability to participate in the community. Fortunately, many mental disorders can be treated effectively at low cost, but there is still a gap between those who need care and those who have access to it. Greater investment is needed on all fronts, including increased awareness of mental health, access to effective treatments, and research to identify new treatments and diagnoses. The role of mental health in achieving global development goals is gaining recognition, proof of which is that mental health is included in the Sustainable Development Goals [22].

We conducted this study by choosing papers that study brain-related disorders. They are mostly mental, although we have also taken others that are not classified as such. We decided to include neurological disorders in addition to mental ones, as they are among the most studied cases and we were able to extract a high number of FE techniques, which is the main objective of this study.

In the following, we will present all the disorders we have taken for our study, indicating which family of brain disorders they belong to. We have based this classification on [2, 23]. All the disorders we mention are found in our study because they come from papers in which work is done with FE techniques together with one or more brain disorders. As explained in more detail in section 4.3, we have taken all disorders appearing in the abstracts of the selected papers and then grouped them according to the disorder family to which they belong. It should be noted that if the same disorder appears in a considerable number of papers, then we use it as a category on its own, without adding it to the family it belonged to.

  • Anxiety Disorders. We have found papers that work diagnosing or classifying Anxiety Disorder such as [24, 25].

  • Bipolar and Related Disorders. We have located papers studying Bipolar disorder such as [26, 27].

  • Depressive Disorders. Within this category, we have picked studies working on MDD such as [28,29,30].

  • Neurodevelopmental Disorders. We have found papers focused on Attention-Deficit/Hyperactivity Disorder (ADHD) [31, 32], and Autism Spectrum Disorder (ASD) such as [33, 34]. We also include papers related to concentration such as [35] and mental tasks like [36] in the ADHD category.

  • Neurological Disorders. Within this group, there are brain disorders that are not considered mental disorders. We have identified papers that work with AD [37], Mild Cognitive Impairment (MCI) [38] and dementia in general [39], which are neurodegenerative disorders. We call this group dementia. We have also found papers studying Parkinson’s Disease (PD) such as [40], which is another neurodegenerative disorder. On the other hand, we have located studies on Migraine such as [41]. Within this category, we have a brain disorder that appears in most of the papers we have selected for this study, namely Epilepsy. We have selected papers classifying, predicting, or diagnosing epilepsy or its seizures such as [42,43,44].

  • Obsessive-Compulsive and Related Disorders. In this family of disorders, we have located papers studying Obsessive-Compulsive Disorder (OCD) such as [45].

  • Schizophrenia Spectrum and Other Psychotic Disorders. Several papers were found under the Schizophrenia category such as [46,47,48].

  • Sleep-Wake Disorders. We have identified papers that work with FE techniques and (i) Sleep Apnea [49], which is a Breathing-Related Sleep Disorder, (ii) Insomnia Disorder [50], and (iii) Non-Rapid Eye Movement Sleep Arousal Disorder [51], which is included in the Parasomnias family. We joined these three disorders in a general group that we call Sleep Disorders.

  • Substance-Related and Addictive Disorders. Within these disorders, we have located papers dealing with FE techniques and Alcohol Use Disorder (AUD) such as [52], Drug-Related Disorder [53], and Non-Substance-Related Disorder such as Gaming addiction [54]. As we have done with the Sleep Disorders group, we have grouped these three disorders in a new group that we will call Addictions. It is worth noting that in the DSM-5, the word addiction is omitted from the official classification due to its potentially negative connotation. Despite this, it is a term commonly used in many countries to describe severe problems related to compulsive and habitual use of substances or behaviors.

  • Trauma- and Stressor-Related Disorders. We have identified studies focused on Post-Traumatic Stress Disorder (PTSD) such as [55, 56], and Stress-Related tasks such as [57, 58], that we just call Stress.

2.2 EEG

The human brain consists of millions of neurons, each of which acts as an electrical dipole that varies its polarity depending on whether the goal of the neuron is to make an excitatory synapse or to make an inhibitory synapse. To register this bioelectric activity, a neurophysiological and non-invasive technique, called EEG, is used. To obtain an EEG it is only required to place small electrodes on the patient’s scalp, with the help of a helmet and/or a conductive gel. This helmet can be made from 1 electrode to hundreds of them. To use an international system that facilitates reproducibility, it is recommended to place the electrodes following the guidelines of the American Clinical Neurophysiology [59] or the International Federation of Clinical Neurophysiology [60]. An EEG contains the sum of all electric changes or potentials produced among the closest neurons. Therefore, the brain activity captured by the EEG is a combination of the information movements that occur in our brain in a certain period of time. It would be similar to the noise that we receive in a big city with a lot of traffic and annoying sounds from time to time.

One of the targets of analyzing the EEG signal is to find the most relevant characteristics of each signal. All EEG signals have two essential measurable characteristics: amplitude and frequency. The amplitude is directly proportional to the number of neurons that emitted their charge at the same time, and frequency counts the number of oscillations that the signal has per second [61].

Fig. 1
figure 1

Brain waves in EEG. Image extracted and modified from [65]

Fig. 2
figure 2

The FE process can be thought as a pipeline: transformation of raw data, feature extraction and finally feature selection. Created by authors with Google Drawings

The signals received by the EEG come in a mitigated manner, so the electrodes located on the scalp have to be very powerful. Therefore, the EEG contains a lot of unwanted noises. That is why we have to do a cleaning process to select only the signal frequencies that we need. In the literature we can find, in a general manner, a classification of signal frequencies into bands. Below, we describe each band and relate them to different behaviors and mental states of the brain, based on [62,63,64]. Also, we provide these band waves in a graphical manner in Fig. 1.

  • Delta wave. Every wave that has between 0.1 and 4 Hz is classified as a Delta wave. These types of waves are characterized by having the highest amplitude and by being the slowest waves. These waves are related to the grey matter of the brain and appear during the sleep stages. Delta waves are abnormal in adults when they are awake. When these waves are present, the growth hormone and the production of melatonin are stimulated. We could find them on the frontal side of the brain in adults, and on the posterior side in children.

  • Theta wave. Theta waves vary between 4 and 8 Hz. We can see, through EEG, these types of waves in meditation and deep relaxation. These brainwaves are normal in children under their teens and abnormal for adults. These waves are in the thalamic region, a part of the brain located in the central area of the brain base, between the two hemispheres, involved in regulating the activity of the senses.

  • Alpha wave. Waves between 8 and 13 Hz are called Alpha waves. These waves are related to the white matter of the brain. Alpha waves are present in all age groups, especially in relaxed and close-eyed adults. These waves are slightly higher in the non-dominant hemisphere. We can better gather Alpha waves from the occipital and parietal regions of the brain.

  • Beta wave. Brainwaves between 13 and 30 Hz are called Beta waves. They are associated with behaviors and actions and these waves are related to our five senses. They appear when we talk, make decisions, solve problems, judge, and be on alert or focused. Beta waves are seen in the frontal and parietal lobes.

  • Gamma wave. Waves that have more than 30 Hz are classified as Gamma waves. This type of brainwaves is characterized by having the smallest amplitude and by being the fastest waves. They are associated with perception and consciousness and appear during hyper alertness. Gamma waves induce the production of serotonin and endorphins. They are seen in the somatosensory cortex, located in the anterior part of the parietal lobe, and it is responsible for receiving and processing sensory information.

Unlike other techniques such as MRI or PET, acquiring an EEG device is much cheaper and it is possible to use it in other places less prepared than a hospital. In addition, EEG is a technique that gathers signals with very low frequencies without being an invasive technique. The characteristics of the EEG technique make this method capable of recording brain activity for hours or days, with the advantage of collecting information not only on brain activity but also the time related-information. As a result, the EEG gathers large amounts of data because of its precision and sophistication. Unfortunately, unwanted data is also collected, so a preprocess is needed to acquire only the necessary data. The EEG is a good choice if we want high temporal resolution. Although it does not have a spatial resolution as good as other techniques such as MRI or PET, there are some triangulation techniques that could improve the spatial resolution of EEG signals.

Fig. 3
figure 3

Year-wise distribution of works related to the topic of signal processing of EEG for ML. Citation Report graphic is derived from Clarivate Web of Science [69], Copyright Clarivate 2023. All rights reserved

2.3 Feature Engineering

FE is the act of extracting features from raw data and transforming them into formats that are suitable for ML models [66]. Choosing the right set of features can make the difference between an unusable model and a simple, effective one. In other words, FE is a crucial step in the ML pipeline, especially when working with noisy, non-stationary data such as EEG.

FE comprises a vast set of techniques and methods, which can be divided into three subgroups: transformations, feature extraction and feature selection. The appropriate set of techniques depends largely on the problem, the type of data and the AI model to be used. Generally, the process of FE can be viewed as a pipeline composed of the aforementioned subgroups: (i) the raw data is transformed into a better format, then (ii) several features are extracted from said format, and finally (iii) the right set of features are extracted, ready to input to the ML or DL model (Fig. 2). A brief introduction to each part of the pipeline is presented below:

Transformations. It comprises any transformation that is applied directly to the raw data. In the case of EEG data, it is common to transform the signal from time domain to other domains, such as the frequency or time-frequency domains via Fourier Transform (FT) [67] or Wavelet Transform (WT) [68], respectively.

Feature extraction. It encloses any technique used to obtain hidden features from the transformed data. In the case of EEG data, features such as Entropy, Energy, or Fractal Dimension (FD) are frequently extracted.

Feature selection. Any method or metaheuristic that is used to select the most significant features for the current problem. This is specifically important in the case of EEG data, as the number of extracted features is typically large.

As can be seen in Fig. 3, there has been an increasing number of works focusing on the topic of EEG signal processing for ML. This upward trend is expected to continue as signal processing techniques play a key role in mental disorder detection by means of EEG. As previously stated, properly performing FE is crucial for the performance of an ML model. This becomes even more important when working with highly non-linear and non-stationary signals such as EEG. Given that the proper set of FE techniques depends on the task, the data and the model used, it remains a challenge to select the right set of FE techniques for mental disorder detection.

2.4 Machine Learning

According to [70], “A ML algorithm is an algorithm that is able to learn from data”. ML has been successfully applied to a wide set of different problems, such as cancer detection [71] or credit risk assessment [72]. This ability to learn from data makes ML algorithms an excellent candidate for EEG data, as the volume of information it provides is large and difficult to be interpreted by humans. Next, we will provide a brief introduction of the concepts needed to follow this work.

Fig. 4
figure 4

ML algorithms can be grouped depending on its learning method. Created by authors with Google Drawings

Fig. 5
figure 5

Year-wise distribution of works related to a) the use of ML and b) the use of DL on EEG. Citation Report graphic is derived from Clarivate Web of Science [69], Copyright Clarivate 2023. All rights reserved

As shown in Fig. 4, ML algorithms can be grouped by their learning method:

Supervised Learning. This group comprises any ML algorithm that learns from labeled data, i.e. the learned ML model tries to predict a given label. The model can be further subdivided by the type of label, if the label is categorical, it is a classification algorithm and if the label is numerical, then it is regression.

Unsupervised Learning. Opposite to the previous group, unsupervised learning includes any algorithms that learn from unlabelled data. This group can be divided into several subgroups depending on the main task of the algorithm. The best known within this type are (i) clustering, which refers to the task of grouping similar data into high-level clusters, and (ii) dimensionality reduction, which consists in reducing the number of features while minimizing information loss.

Reinforcement Learning. It refers to algorithms based on intelligent agents that learn by interacting with their environment.

It is important to remark that there is no clear line that separates these groups, and some ML algorithms may be difficult to classify into such groups. For example, there are ML algorithms that learn in a semi-supervised manner, that is, they learn on data that is partially labeled. However, it is still a valid classification for most of the problems.

DL is a subset of ML that emerged as a solution to some of the problems that classical ML algorithms struggled to solve. Specifically, ML algorithms may have difficulties in problems where the data is so complex that the manual engineering of features becomes unfeasible. As an example, the task of classifying a digit from an image is considerably difficult with ML. One could manually build detectors of multiple shapes, i.e., classify the image as 1 when a vertical line is detected. However, it would be costly and ineffective, since rotating or changing the typography of the digit would make the ML algorithm fail. On the other hand, DL algorithms are able to automatically learn representations from the data. Then, instead of having to manually craft the needed representations for classifying the image, the DL algorithm is able to automatically learn them, making it a preferable choice for this kind of problem. Therefore, DL frameworks could revolutionize the clinical applications for EEG-based diagnosis.

Figure 5 shows a year-wise comparison of the number of works related to the application of ML or DL on EEG data. It can be seen that ML is more prevalent than DL. However, since approximately 2012, the popularity of DL has been exponentially increasing and is expected to become prevalent if the current trends continue.

Recent developments in DL have led to promising results in the area of medical diagnostics, especially in the diagnosis of mental and neurological disorders. Regarding Depression, [73] has proposed an EEG-based DL framework that automatically discriminates depressed and healthy controls and provides the diagnosis, achieving \(98.32\%\) accuracy, using a model based on a Convolutional Neural Network (CNN). This high accuracy makes it possible to use this DL model as an automatic diagnostic model for Depression. Going further, [74] has developed a DL model based on the attention technique that classifies Twitter data and predicts the depressed and non-depressed users, reaching \(99.86\%\) accuracy. This shows that early detection of depression is possible simply by analyzing social media posts, which could improve or save people’s lives. A similar case occurs with Dementia. [75] has used a CNN-based model to diagnose AD from neuroimages, achieving \(95.73\%\) accuracy, which could serve as a computer-aided system for physicians who need to make an early diagnosis. [76] has used a DL approach to predict brain age using MRI of brain grey matter, showing that the difference between the predicted and the chronological brain age serves as a biomarker for early-stage neurodegeneration. Another successful case is Epilepsy, where the early detection of seizures through non-invasive and wearable devices could improve the management of Epilepsy. [77] has developed a Long-Short Term Memory (LSTM)-based DL model to detect and classify epileptic seizures in an ambulatory and in-hospital environment using wearable devices. The proposed model has yielded a mean Area Under the Curve (AUC) of 0.97 and 0.98 for ambulatory and in-hospital patients respectively. Thus, it demonstrates that the detection of motor epileptic seizures is possible through the use of wearable devices.

To summarize, DL models are excellent candidates for the detection of brain disorders, and especially mental disorders, due to their automatic representation learning capabilities. On the other hand, traditional ML models do not have that capability and require their input features to be hand-crafted, but they are simpler to train and require less amount of data.

Despite its success, DL also has a disadvantage over ML, as DL algorithms are considerably less explainable than traditional ML algorithms. In ML, it is easier to explain or analyze why the model has made a certain prediction (e.g., which features have contributed the most to the prediction and how), whereas this becomes a complicated task in DL. The lack of explainability considerably hinders the application of DL to fields such as medicine [78, 79], where an erroneous decision can have critical consequences. In order to use a DL system to support decisions, the domain expert has to be able to understand the predictions of the model to avoid errors. However, great progress is being made in the field of eXplainable Artificial Intelligence (XAI) applied to the medical field [80, 81]. Finally, it is important to remark that, as mentioned above, DL models are able to automatically learn the adequate representations for each problem, thus reducing the need for FE. However, DL architectures can still greatly benefit from said techniques, as they can increase their performance and/or enable us to use a simpler architecture.

Since the aim of this work is to perform an SMS focused on FE, ML and EEG applied to mental disorders, the main task that will be encountered is to diagnose a mental disorder, i.e., to predict whether an individual has a mental disorder or not by analyzing their EEG data. Thus, it will be a supervised problem, since the main objective is to predict a label, and a classification problem, because the label is categorical (the individual has a mental disorder or not). Therefore, most of the algorithms found during the present work are expected to be classification algorithms. A brief description of some of the most common ML classification algorithms is presented below. Since presenting too extensive explanations would be out of the scope of this paper, we refer to [82, 83] as excellent resources to obtain in-depth descriptions of ML models.

  • Logistic Regression. Logistic Regression is one of the most popular models due to its simplicity and interpretability. Logistic Regression models the log odds of the event as a linear function:

    $$\begin{aligned} \log \left( \dfrac{p}{1-p} \right) = \beta _0 + \beta _1x_1 + ... + \beta _px_p \end{aligned}$$
    (1)

    where p is the probability of an event or a specific class (e.g., a patient has ADHD), \(\beta _i\) are the coefficients learned by the model and \(x_i\) are the values of the input features. The coefficients of the model are typically learned by minimizing the negative log-likelihood. Once the parameters are learned, the probability p for a given sample can be predicted as

    $$\begin{aligned} p= & {} \dfrac{1}{1 + \exp {-(\beta _0 + \beta _1x_1 + ... + \beta _px_p)}}\nonumber \\= & {} \sigma (\beta _0 + \beta _1x_1 + ... + \beta _px_p) \end{aligned}$$
    (2)

    where \(\sigma \) is the sigmoid function. Despite the simplicity of the model, Logistic Regression models fail to detect more complex, nonlinear patterns, so they have limited predictive power.

  • Decision Tree (DT). The aim of DTs is to partition the data into smaller, more homogeneous groups. In other words, the data space is partitioned via a series of if-then statements, where each partition has an associated predicted label. DTs are mathematically simple models, which makes them easily interpretable. These models are also capable of handling different types of data, are robust to outliers and perform automatic feature selection. However, these models tend to be unstable, i.e. a slight difference in the data can drastically change the structure of the tree. Thus, interpreting the model becomes more difficult as the tree grows bigger. Moreover, the predictive performance may not be optimal because the trees partition the data into rectangular regions, which can hinder the detection of some patterns. Ensemble methods such as Random Forest, which combines many DTs into one, have been designed to combat these disadvantages.

  • Support Vector Machine (SVM). SVMs were introduced in the 90s for binary classification [84] and have been extended to regression and multilabel classification. SVMs were designed in the context of robust regression, i.e. building a model that is robust to outliers. Moreover, SVMs are powerful models which are able to extract nonlinear patterns from the data.

  • Naive Bayes. The Naive Bayes is a probabilistic classifier that is based on applying the Bayes’ rule. This rule answers the question “based on the predictors that we have observed, what is the probability that the outcome is class \(C_l\)?”, which is mathematically described as [83]:

    $$\begin{aligned} Pr\left[ Y = C_l \mid X \right] = \dfrac{Pr\left[ Y\right] Pr\left[ X \mid Y = C_l\right] }{Pr\left[ X\right] } \end{aligned}$$
    (3)

    where X represents the input features and Y the class variable.

  • k-Nearest Neighbors (k-NN). The aim of this model is to classify a new sample based on the labels of the k-closest points in the training set given a distance metric. In the case of classification, we assign the most common class among the neighbors as the predicted class.

DL can also be used for classification, using architectures such as Multilayer Perceptron (MLP), CNN, and Recurrent Neural Networks (RNN). Refer to [70] for an excellent in-depth description of DL models. It is also expected to find algorithms beyond classification, such as clustering since it can be used to group individuals based on their similarities and extract insights.

3 Related works

To the best of our knowledge, there are only three works related to our study. The first one is an SMS focused on the diagnosis and prognosis of mental disorders using EEG and DL techniques [21]. This study is led by four research questions. The first one focuses on which mental disorder is diagnosed and prognosed by means of EEG and DL. The second research question identifies which DL techniques are applied. The third research question collects other biometric data used to diagnose and prognose mental disorders along with EEG signals. The last research question focuses on the source of the datasets used to carry out each reviewed paper. Afterwards, the authors elaborate several graphs with the distribution of the studies according to the answers to the research questions. It is important to say that, in this SMS, 46 out of 373 works were selected to do the mapping.

The second study that we found in the literature is a systematic review of ML algorithms used to analyze data from wearable devices and sensors [85]. 67 studies were selected out of the 1530 pre-selected works. To carry out that review, the authors propose four research questions that cover the whole processing of healthcare data. The first research question deals with the types of sensors or wearable devices used for gathering data. The second question consists in the use of one feature type for different kinds of data. The third research question refers to the use of ML algorithms to analyze healthcare data. The fourth question addresses how to combine, process, and analyze heterogeneous types of healthcare data. Subsequently, several graphs are displayed, showing the distributions of sensors used for monitoring symptoms, types of features extracted, ML and neural network algorithms used for the chosen analysis and evaluation criteria. In addition, they show the distribution of algorithms with the best performance.

Finally, we have found [20], which is the study most related to ours. Their SMS focuses on highlighting which neurological disorders, feature extraction techniques, feature selection methods and classifier algorithms have been studied the most in the past using EEG signals. In this research, 144 studies have been evaluated to compose the paper. Regarding the research questions, two are proposed: one is related to the feature extraction technique used and the other one is focused on the feature selection method applied. In addition, they add a section in which some quality criteria are proposed to give a quality grade to each reviewed paper. Graphically, they show the distribution of papers and mental disorders studied according to the year of publication and a bubble chart that displays the joint occurrence of feature extraction and selection techniques in the reviewed studies.

The first SMS, [21], is closely related to our work, but it focuses on DL techniques, whereas our study is focused on FE. On the other hand, although the second paper [85] includes EEG signals as e-health data, it focused on ML techniques and considered papers that study disorders like heart disease, diabetes, blood diseases or hypertension, which are not considered mental disorders. In addition, it selects studies that use wearable devices. Therefore, it is focused on ML algorithms that could provide meaningful results in a reasonable time period and with reasonable complexity to be used together with these wearable devices. Finally, the third SMS [20], although it is the most related to ours, is written in Portuguese, so it is not very accessible to the scientific community. Moreover, it excludes studies that are published before 2013 and we have no restriction in this regard. Whereas the other papers leave out a significant portion of the advances achieved, unlike ours which is more recent and covers a broader time span.

Therefore, after having presented the most related works to ours, we could say that our study is the first English-written SMS focused on FE of EEG signals used to identify mental disorders. Furthermore, the number of studies that we have gathered -6133- and the number of papers reviewed -905- are far superior to the other SMS that they could be compared with.

4 Methodology

SMS is a secondary study which aims to provide an overview of a research area by identifying the quantity, type of research, and results available within [86,87,88]. The difference between a primary and a secondary study is that the former presents direct advances in the research area, whereas the latter gathers data from such primary studies to extract insights. Other types of secondary studies, such as a Systematic Literature Review (SLR) [89], can also be used to provide such an overview. The main difference between them is that the SLR is focused on a detailed reading of a small number of papers, whereas the SMS is focused on a less detailed reading of a large number of papers. In this case, we decided that the SMS would provide better results since the research area of interest is quite broad and a large number of primary studies are expected.

This work will follow the methodology presented in [87]:

  1. 1.

    Definition of Research Questions. These questions will guide the whole process in order to achieve the desired goals.

  2. 2.

    Conduct Search. Searching for primary studies that could be related to the research questions by using string queries on several scientific databases.

  3. 3.

    Screening of papers. Several exclusion and inclusion criteria are defined in order to discard papers that are not related to the defined research questions.

  4. 4.

    Keywording using Abstracts. Reading the abstracts and looking for keywords that could characterize each paper. Then, the keywords are used to form higher-level groups.

  5. 5.

    Data Extraction and Mapping Process. The previous groups are used to answer the research questions via analysis and different visualizations. A frequency analysis will enable us to identify which topics have been exploited in the past, gaps in the literature, and future research directions.

As an additional step, we will briefly describe the secondary studies collected during the latter process, as this can act as a starting point for researchers that decide to take a certain research opportunity after analyzing the results of the SMS.

4.1 Definition of research questions

Since our study is focused on mental disorders and FE of EEG signals, we proposed the following research questions, that we will abbreviate as RQs:

  1. RQ1:

    Which mental disorders have been studied using ML, DL and FE of EEG signals?

  2. RQ2:

    Which FE techniques have been used before feeding the data into ML algorithms?

    1. RQ2.1:

      Which transformation techniques have been used?

    2. RQ2.2:

      Which feature extraction techniques have been used?

    3. RQ2.3:

      Which feature selection techniques have been used?

  3. RQ3:

    Which ML and DL techniques are used after applying FE techniques?

  1. RQ4:

    Which secondary studies have been done in the research field?

The first, second and third RQs have been selected according to the goal of this work, which is to gather all FE techniques used to mainly classify mental disorders, using AI models and EEG data. With RQ1 we want to extract the mental or neurological disorder each work focuses on. It can be only one disorder or more than one in each paper collected. RQ2 is the most important question of this work. We divided it into three different ones because several studies [90,91,92] consider that there is this number of steps (3) before feeding the AI model with data. RQ2.1 deals with the data transformation techniques, in which the data change their domain, from raw EEG signal to 2D images, graphs networks, Wavelets or another domain. With RQ2.2 we want to gather every feature extraction used to feed the AI models. It is essential to show all characteristics considered necessary to identify differences between control and brain disorder patients. RQ2.3 collects all the algorithms used to select which features best capture the necessary information to classify the chosen brain disorder. In RQ3 we want to gather all ML and DL techniques used as the last step to classify patients with control and brain disorders. It should be noted that we do not extract the accuracy achieved, nor the contributions and drawbacks of each study. This is mainly due to the fact that we have only read the abstract of each paper and we were seldom able to clearly find the contributions and drawbacks. There is also a reason not to collect the accuracy of each paper and it is because we consider it unfair to compare the performance of studies carried out with different databases. We have not collected the country or year of publication either, since the vast majority of the articles gathered did not provide us with this information. Lastly, RQ4 is introduced as an addition to the whole methodology, and it will be answered by carefully selecting the secondary works retrieved by the search conducted in the section below. We thought that this addition would be helpful for researchers that decide to take a certain research opportunity after reading the results of this SMS. We decided not to include these studies in the related works section because these papers are not sufficiently related to our work.

Table 2 Queries used for each database
Table 3 Inclusion and exclusion criteria defined to filter the obtained works

4.2 Conducted search

In order to find relevant works related to the defined RQs, the following terms were considered:

  • EEG

  • FE

  • Feature extraction

  • Feature selection

  • Statistical parameters

These terms were composed into the query: EEG AND (feature engineering OR feature extraction OR feature selection OR statistical parameters). The search was conducted in five popular public databases: Scopus, IEEE Xplore, Web Of Science, ACM Digital Library and ScienceDirect. The specific queries written on the syntax of each search engine are shown in Table 2.

It is important to remark that the EEG term was searched on the title, abstract and keywords, but the rest of the terms were only searched on the title and keywords. This is due to the fact that in some papers, DL is used directly on raw EEG signals, so they mention that there is no need for FE in the abstract. As no FE techniques are used, this will be an exclusion criterion, which will be defined in the next subsection.

4.3 Screening of papers

Once the search is conducted and the duplicated papers are removed, a set of inclusion and exclusion criteria is defined in order to refine the search results and only take the works that are relevant to the research questions of our study. The criteria defined to guide the filtering process are shown in Table 3.

We choose four inclusion criteria to gather the necessary papers with greater precision. With the aim of collecting as many FE techniques as possible, we have included conference papers (i2) as well as academic journal papers (i3). In addition, only primary studies were chosen (i1) due to the goal of this SMS. As we need quality papers, we added an English-written inclusion criterion (i4). It is worth noting that we have not added any criteria related to the date of publication, since we consider that there may be FE techniques, which were previously used without promising results, that now, with the ML and DL models, could be useful. We do not consider it necessary to discuss the three exclusion criteria (e1, e2, e3) because they are trivial.

As shown in Fig. 6, after removing duplicates, the inclusion criteria are applied in order to keep the papers that are relevant to our study and discard works such as book chapters, reviews, etc. This was done both automatically by using the search engines and manually by reading the titles and abstracts, to ensure that none of the selected papers bypassed the criteria. Then, the filtering process continues manually applying the different exclusion criteria by reading the title and abstract of the selected works. Once all the criteria have been applied, a total of 905 papers are left.

It is important to remark that, as previously stated, a total of 20 secondary studies were carefully retrieved in order to present them in a compilation after the whole SMS process. This will act as a starting point for researchers that decide to take a certain research direction. These 20 studies were retrieved out of the 72 studies that did not meet the inclusion criteria i1 (Fig. 6).

Fig. 6
figure 6

Diagram of the filtering pipeline via the exclusion and inclusion criteria. Created by authors with TikZ package from LaTeX

4.4 Keywording of full text

Once the screening of papers is done and the final batch of papers is collected, it is time to classify each paper via the process denominated Keywording. As presented in [87], keywording is a systematic process that ensures all papers are taken into account, while reducing the time needed in developing the classification scheme. Keywording is divided into two steps:

  1. 1.

    The authors read the abstracts and define keywords that characterize the work of each paper related to the different RQs. This will result in a highly granular set of keywords, but it will help the authors to fully identify the context of the research. To illustrate this process, let us use an example. Suppose that we find a paper that has in its abstract the following: Our goal is to distinguish ADHD from ASD subjects. To carry out this classification we use WT and extract nonlinear features to feed three AI models: SVM, k-NN and CNN-based model. In this case, we would extract seven keywords. ADHD and ASD inside RQ1, WT for RQ2.1, nonlinear features in RQ2.2 and SVM, k-NN and CNN-based models in RQ3, leaving RQ2.3 without keywords. If more details are needed, all the tags for each reviewed work, together with its title and authors are in the supplementary information.

  2. 2.

    The keywords obtained are combined to form higher-level categories that will be used to answer the RQs in the following subsection, except RQ4 which will be answered separately in the last subsection. In this step, we have gathered all keywords and grouped them into more general groups. These groups were made following our criteria and observing the number of labels of each category. It should be noted that we have taken some labels out of the general categories if they were significant, i.e., if they appeared a considerable number of times. As an example, we have made a group called Nonlinear features in which we add all papers that do not specify which nonlinear features have been used, but we made other groups in which there are nonlinear features as well, such as Chaotic features group or Complexity Measures. Therefore, we know there are labels that can be in more than one group. If more details are needed, the Appendix 1 contains all the categories we have created along with the labels that make up each of them.

After performing the first step, a total of 634 keywords were obtained from all RQs. Once the resulting keywords had been carefully analyzed, the opensource application OpenRefine [93] was used to combine them into general groups, resulting in 15, 14, 15, 8 and 14 categories associated to RQ1, RQ2.1, RQ2.2, RQ2.3 and RQ3 respectively. The categories defined for each RQ are briefly described below:

RQ1. ADHD is a neuropsychiatric disorder characterized by hyperactive-impulsive and/or inattentive behavior which has a high prevalence among young people, but can be carried onto adulthood [94]; Addictions are mental disorders characterized by the recurrent failure to control a compulsive behavior in order to obtain reward stimuli despite the negative consequences [95]; Anxiety is a feeling of worry and fear in a diffuse threat, which can be out of proportion and interfere with the daily lives of the affected [96]; ASD is a neurodevelopmental disorder characterized by deficits in social communication and the presence of restricted interests and repetitive behaviors [97]; Dementia is characterized by the deterioration of mental functioning in its cognitive, emotional and conative aspects [98]; Depressive Disorders are characterized by having a lowered mood, and the loss of interest and enjoyment during periods, among other symptoms [99]; Dyslexia occurs when an individual has significant difficulties with speed and accuracy of word decoding [100]; Epilepsy is a neurological disorder characterized by an enduring predisposition to generate epileptic seizures due to abnormal excessive neuronal activity in the brain [101]; Migraine is characterized by severe headache attacks, autonomic nervous system dysfunction, and in some patients, an aura involving neurological symptoms [102]; OCD is characterized by intrusive unwanted thoughts and/or images (obsessions) and ritualized repetitive behaviors (compulsions) [103]; PD is a neurodegenerative disorder that mainly affects motor function [104]; PTSD is a disorder that can be developed after a traumatic experience [105]; Schizophrenia is a severe psychiatric disorder mainly characterized by having delusions, hallucinations, psychotic episodes, marked alterations in cognition, and impaired functioning with high rates of disability [106]; Sleep Disorders include disorders related to sleep, namely Sleep Apnea [107], Insomnia [108] and Sleep Arousals [109]; Stress is a condition in which an individual is aroused and made anxious by an uncontrollable aversive challenge [110].

RQ2.1. Common Spatial Patterns (CSP) is a procedure used to decompose signals of two different groups into modes that are common to both groups and maximally suited to distinguish them [111]; Complex Networks refers to the methods used to transform the raw EEG signals into a graph, which can then be used to extract topological features [112]; Discrete Cosine Transform (DCT) is a frequency domain transformation which decomposes a signal into cosine waves with different frequencies [113]; FT represents a set of widely used techniques, which decompose a function depending on time into a set of functions that depend on frequency in order to obtain the so-called spectrum of frequencies [67]; Frequency Transformation comprises any transformation in the frequency domain that is not further specified; Hilbert-Huang Transform (HHT) enables us to obtain instantaneous frequency data, making this technique specially suitable for nonstationary and nonlinear data [114]; Linear Transformation comprises any linear transformation that is not further specified; Mel-Frequency Cepstrum is a transformation that can be described as a kind of “spectrum of a spectrum” [115]; Mode Decomposition is composed of all methods that decompose the signal without leaving the time domain, mainly Empirical Mode Decomposition (EMD) and variations [116]; Nonlinear transformations includes any nonlinear transformations that are not further specified; Short Time Fourier Transform (STFT) is a variant of the FT that decomposes the signal into time-frequency components, similar to the WT [117]; Time-Frequency Transformation includes transformations in the time-frequency domain that are not specified; similarly, Time Transformation comprises transformations in the time domain that are not specified; WT is similar to the FT so that it performs a decomposition into functions depending on frequency, but also on time, i.e. it gives local frequency information [118].

RQ2.2. Autoencoder (AE) [119] is a type of neural network used to extract features that efficiently represent the data. The AE is trained to map the input data to a smaller vector space, and then to reconstruct the original vector, i.e., it is trained to map the data into a smaller feature space while minimizing the loss of information. Therefore, it can be thought of as an automatic feature extraction technique; Autoregressive Model (AR) is used to describe signals with a set of parameters, and then those parameters can be used as features [120]; Chaotic Features refers to any variables that measure whether a system is chaotic or not, such as the Lyapunov exponents [121]; Complexity Measures can be used to estimate brain dynamics, which can be useful to study mental disorders by means of EEG signals. This group is composed of the Kolmogorov Complexity (KC), Lempel-Ziv Complexity (LZC), epsilon-Compexity [122] and fractal-related features, namely FD, Correlation Dimension (CD) and Line Length [123]; Correlation Measures can be used to measure the relation between two signals, in this case, two different channels. The main feature of this group is the coherence [124]; The Energy of a signal is defined as the area under its square magnitude. Nevertheless, energy is typically obtained via the summation of all frequency components of its spectra, thanks to the Parseval theorem [125]; Entropy can be used to measure the degree of randomness or unpredictability of the system, as well as being related to chaos. There exist many ways of computing entropy, such as Sample Entropy (SampEn), Approximate Entropy (ApEn), Fuzzy Entropy (FuzzyEn), etc. [126]; Frequency Domain Features comprises any feature computed in the frequency domain that is not further specified; Geometric Features are extracted from 2D representations of the signal, for example by applying WT. Most features are focused on texture, such as Local Binary Patterns [127] and their variations and Haralick Features [128]; Graph Features describe the topology of the graphs obtained from the raw EEGs such as the Degree of Centrality [129] or the Direct Transfer Function [130]; Nonlinear Features is composed by any nonlinear feature that is not included in any of the other groups, and/or it is not specified; Statistical Parameters is comprised by any statistical measure of the signal that is not included in any of the other groups, such as the variance, Hurst exponent [131], Hjorth parameters [132], etc; Tensor Decomposition refers to any feature extraction technique that uses data in tensor (or matrix) form such as singular value decomposition [133], Hermite decomposition [134] or Wishart distribution [135]; Finally, the groups Time Domain Features and Time-Frequency Domain Features enclose any features that belong to the aforementioned domains and are not included in any of the previous groups.

Fig. 7
figure 7

Frequency of the defined brain disorder categories. Created by authors with Matplotlib package from Python

RQ2.3. Clustering refers to the task of grouping similar data into high-level clusters. Once the clusters are formed, they can be used for feature selection by analyzing which features are most relevant to differentiate them [136]; Distance Based Feature Selection encloses all the methods that use the distance between data points to perform feature selection, for example, the Relief method [137]; Genetic Algorithm is referred to the group of feature selection metaheuristics based on the theory of evolution [138]; L1 Regularization is a regularization technique that can be used to perform feature selection, as it shrinks the coefficients of irrelevant features to zero [139]; Metaheuristics encloses all the feature selection metaheuristics that cannot be included in any of the presented groups. A metaheuristic is defined as a high-level procedure designed to solve complex optimization problems, in this case, a feature selection problem [140]; Principal Component Analysis (PCA) is a technique that consists of building successive uncorrelated variables called principal components that maximize variance[141]. It can be used either as a dimensionality reduction tool, by choosing the first k principal components, or as a feature selection tool. As the principal components will be linear combinations of the initial set of features, the coefficients (also called loadings) can be used to rank the importance of each feature; A Statistical Test is a procedure used to check whether there is enough evidence to reject a conjecture (called the null hypothesis) or not. This kind of tests can be used for feature selection by defining the null hypothesis as \(H0: \text {The feature} X \text { is relevant for the current model}\). If the null hypothesis is rejected by the test, then the feature is discarded; Swarm Intelligence refers to the set of optimization metaheuristics inspired by the behavior found in collective and decentralized biological systems [142].

Fig. 8
figure 8

Frequency of the defined groups for RQ2.1. Created by authors with Matplotlib package from Python

RQ3. DT arranges a set of basic decisions into a tree structure in order to make the final prediction. The advantage of this model is that it can be easily explained [143]; Ensemble refers to any model that combines a set of weak classifiers to get a better global solution, such as Random Forest [144] or XGBoost [145]; Clustering refers to the task of grouping similar data into high-level clusters. This group is composed of any use of clustering that is not feature selection. If clustering is used for feature selection, it is then included in RQ2.3; Naive Bayes is a family of classification algorithms based on applying Bayes’ theorem [146]; Fuzzy Classifier is referred to any classifier that is based on fuzzy logic. In fuzzy logic, as opposed to Boolean logic, values can be any real number between 0 and 1 [147]; Hidden Markov Model models the system as a Markov process with hidden states. A Markov Process is a stochastic model where the next state depends solely on the current state [148]; Logistic Regression is a popular classification model that provides the probability of belonging to each class, while also being easily explainable [149]; Linear Discriminant Analysis (LDA) tries to find the hyperplane that minimizes interclass variance while maximizing the distance between classes. This can be used as a classifier, or as a dimensionality reduction tool [150]; SVM is a robust classifier characterized by being highly effective on high dimensional data [84]; k-NN predicts the class of a data point by looking at the classes of the neighboring points [151]; MLP is the most basic form of neural network, composed of an input layer, an output layer, and at least one hidden layer between these two. Each layer is composed of neurons, and the weights of the connections between the neurons of consecutive layers are trained via backpropagation [152]; CNN is able to automatically extract spatial features via convolution operations, making them useful on signal or image data [153]; RNN encloses any neural network that has feedback loop connections, making them suitable to learn from sequential data [154]; Extreme Learning Machine (ELM) is a feedforward neural network whose nodes are randomly chosen and fixed, and its output weights are obtained analytically in a single step, resulting in a faster training process than in a conventional backpropagation neural network [155].

4.5 Data extraction and mapping of studies

Finally, the last step of the methodology is to extract insights from the obtained classification scheme in order to answer the RQs proposed in Sect. 4.1. First, a frequency analysis will be performed for the categories of each RQ. It is important to notice that the sum of the frequencies in the following analysis might not be equal to the totality of the papers, as there are papers labeled with zero or more than one label of the same RQ. Additionally, the code used to create Figures 7-10, 12-20 and the supplementary information are available on GtLab, so that the reader can access, verify, and extract new insights from our work.

RQ1. A total of 15 brain disorder categories are present in the collected works (Fig. 7). Epilepsy is by far the most predominant one, representing \(68.40\%\) (619 papers) of the total, followed by Sleep Disorders with \(8.73\%\) (79 papers). There are less studied categories, with 20-40 papers each, namely Depressive Disorders (\(3.87\%\), 35 papers), Addictions (\(3.31\%\), 30 papers), Dementia (\(3.31\%\), 30 papers), Schizophrenia (\(2.54\%\), 23 papers), ADHD (\(2.10\%\), 22 papers), ASD (\(2.10\%\), 19 papers). Finally, the least studied categories are Stress (\(1.88\%\), 13 papers), Dyslexia (5 papers), PD (5 papers), Migraine (4 papers), Anxiety (4 papers), PTSD (2 papers) and OCD (2 papers).

Fig. 9
figure 9

Frequency of the defined groups for RQ2.2. Created by authors with Matplotlib package from Python

RQ2.1. The frequency of the transformations can be seen in Fig. 8. The WT is the most used transformation by far, being used in roughly \(31.82\%\) of the papers (288 papers), followed by Mode Decomposition techniques (\(11.71\%\), 106 papers), FT (\(4.75\%\), 43 papers), Complex Networks (\(3.76\%\), 34 papers), HHT (\(2.21\%\), 20 papers), Nonlinear Transformations (\(1.44\%\), 13 papers), Linear Transformations (\(1.22\%\), 11 papers), CSP (\(1.10\%\), 10 papers), Frequency Transformation (\(0.88\%\), 8 papers), STFT (\(0.88\%\), 8 papers), Mel-frequency Cepstrum (\(0.55\%\), 5 papers), Time-frequency Transformation (\(0.55\%\), 5 papers), DCT (\(0.44\%\), 4 papers) and Time Transformation (\(0.11\%\), 1 paper). There are 417 studies (\(46.1\%\) of the works) where no transformations are studied. It is also interesting to highlight that, in some papers, more than one transformation is applied. Table 4 shows the four most frequent pairs of transformations applied: 22 papers study the use of Mode Decomposition together with WT, 13 study FT together with WT, 6 study the use of HHT and WT, and 4 papers study the FT and Mode Decomposition.

Table 4 Most frequent pairs of transformations used in a single paper

RQ2.2. By analyzing the frequency chart for RQ2.2 (Fig. 9), it can be seen that the most used features are Statistical Parameters (\(22.54\%\), 204 papers), Frequency Domain Features (\(18.01\%\), 163 papers) and Entropy (\(18.01\%\), 163 papers), followed by Time Domain Features (\(7.85\%\), 71 papers), Complexity Measures (\(7.07\%\), 64 papers), Energy (\(7.07\%\), 64 papers), Nonlinear Features (\(6.41\%\), 58 papers), Time-Frequency Domain Features (\(5.64\%\), 51 papers), Correlation Measures (\(4.86\%\), 44 papers), AR (\(3.31\%\), 30 papers), Chaotic Features (\(3.09\%\), 28 papers), Geometric Features (\(2.87\%\), 26 papers), AE Features (\(2.10\%\), 19 papers), Tensor Decomposition Features (\(2.10\%\), 19 papers) and Graph Features (\(1.66\%\), 15 papers). Out of the compiled works, 261 of them (\(28.8\%\)) do not study any feature extraction-related techniques.

RQ2.3. As for the frequency of feature selection techniques, it can be seen in Fig. 10 that PCA is the most studied one, being present in roughly 5% of the works (49 papers), followed by Genetic Algorithms (38 papers, \(4.20\%\)), Statistical Tests (29 papers, \(3.20\%\)), Swarm Intelligence (23 papers, \(2.54\%\)), Distance Based Feature Selection (15 papers, \(16.57\%\)), Clustering (12 papers, \(13.26\%\)), Metaheuristics (12 papers, \(13.26\%\)) and L1 Regularization (6 papers, \(6.63\%\)). A total of 735 studies (\(81.2\%\) of the works) do not take into account any feature selection techniques.

Fig. 10
figure 10

Frequency of the defined groups for RQ2.3. Created by authors with Matplotlib package from Python

It is also worth noting that a single study can use techniques corresponding to different sub-questions of the RQ2, i.e., it could use the WT, which is a transformation (RQ2.1), and a Genetic Algorithm, which is a feature selection technique (RQ2.3). It is worth mentioning that the same study may have different labels of the same RQ, e.g. if the study uses various feature extraction techniques such as Entropy, Energy and Nonlinear Features, then it will have these three labels associated to the RQ2.2. A Venn diagram (Fig. 11) enables us to easily visualize which papers study each sub-question of the RQ2: there are 321 papers which focus only on using and studying feature extraction-related techniques, 242 papers which study a combination of transformations and feature extraction, 172 which focus only on transformations, etc. It can also be seen that the total number of papers adds up to 905 works, as expected.

Fig. 11
figure 11

Venn diagram representing the number of papers that study each subquestion of RQ2. Created by authors with Matplotlib package from Python

RQ3. Regarding the techniques used in the works to classify brain disorders (Fig. 12), the SVM is the most applied technique, used in \(33.04\%\) of the works (299 papers), followed by the MLP (\(18.56\%\), 168 papers), k-NN (\(14.70\%\), 133 papers), Ensemble models (\(8.40\%\), 76 papers), CNN (5.52%, 50 papers), LDA (\(5.08\%\), 46 papers), Naive Bayes (\(3.65\%\), 33 papers), DT (\(3.31\%\), 30 papers), Logistic Regression (\(2.76\%\), 25 papers), RNN (\(2.65\%\), 24 papers), Fuzzy Classifier (\(2.21\%\), 20 papers), Clustering (\(1.99\%\), 18 papers), ELM (\(1.99\%\), 18 papers) and Hidden Markov Model (\(0.66\%\), 6 papers).

Fig. 12
figure 12

Frequency of the defined ML and DL categories. Created by authors with Matplotlib package from Python

Table 5 Three most common combinations in each bubble map (Figs. 13-20)

It is also interesting to extend the frequency analysis by relating different RQs via bubble maps, as proposed in [87]. This kind of visualization enables us to observe which categories have been already extensively studied, and where the possible research gaps for future directions are. It should be noted that there are two bubble plots missing: mapping between RQ2.1 and RQ2.3, and mapping between RQ2.2 and RQ2.3. This is due to the limited information they provide, as there are few articles in RQ2.3, in this manner, we reduce the length of this paper as well. In addition, the three most common combinations for each of the pairs of RQs studied are presented in Table 5.

Fig. 13
figure 13

Mapping between RQ1 and RQ2.1. Created by authors with Matplotlib package from Python

Fig. 14
figure 14

Mapping between RQ1 and RQ2.2. Created by authors with Matplotlib package from Python

Fig. 15
figure 15

Mapping between RQ1 and RQ2.3. Created by authors with Matplotlib package from Python

Fig. 16
figure 16

Mapping between RQ1 and RQ3. Created by authors with Matplotlib package from Python

Fig. 17
figure 17

Mapping between RQ2.1 and RQ2.2. Created by authors with Matplotlib package from Python

Fig. 18
figure 18

Mapping between RQ2.1 and RQ3. Created by authors with Matplotlib package from Python

Fig. 19
figure 19

Mapping between RQ2.2 and RQ3. Created by authors with Matplotlib package from Python

Fig. 20
figure 20

Mapping between RQ2.3 and RQ3. Created by authors with Matplotlib package from Python

In addition, we consider interesting to focus on brain disorders. We have also created a table (Table 6) showing the most used FE techniques for each disorder.

4.6 Research question 4

During the previous process, a considerable number of related secondary studies were collected. A brief summary of these works is presented in this section as an additional step of the methodology, as it can act as a starting point for researchers that decide to take a new research opportunity. This section will be structured as follows: first, we present secondary studies related to each phase of the RQ2 (transformation, feature extraction, feature selection), followed by a subdivision of brain disorders.

4.6.1 Signal transform related works

This paper [156] focuses on the applications of sparse representation in brain signal processing. In addition, it deals with Blind Source Separation (BSS), EEG inverse imaging, components extraction, feature selection, and classification. Then, all these techniques are applied to EEG and fMRI data.

[157] presents a comprehensive description of time-frequency and time-scale representations of non-stationary signals such as EEG. They conduct an in-depth review of the principles and design of time-frequency and time-scale methods. Then, time-frequency and time-scale features are presented before comparing the classification efficiency of each one.

[158] compares 19 studies focused on WT and EMD to transform the EEG signal for diagnosing Epilepsy. They gather the type of wavelet and EMD used, the performance obtained, the number of folds used for the cross-validation technique and classification techniques applied for each performed study.

4.6.2 Feature extraction related works

There are other studies which focus on feature extraction methodology. [159] presents FD features and composes a review with more than fifty papers that have applied FD and multi-fractal geometries to extract information from Electrocardiogram (ECG) and EEG signals, brain imaging, mammography and/or bone imaging.

Another paper that focuses on a single technique of feature extraction is [160]. They work with Pattern Recognition techniques to extract information from image features obtained from the transformed EEG signals. In addition, they make a deep description of the most used Pattern Recognition techniques.

4.6.3 Feature selection related works

The work from [161] is only focused on feature selection. They make a review of the most used techniques of feature selection, from 2015 to 2019, in the field of medicine. After all techniques are presented and described, they are applied to different types of data, like medical images (X-rays, CT scans, MRI, retinographies and ultrasound images), biometric signals (EEG, ECG and Electromyography (EMG)) and DNA microarray. Finally, they show an experimental study to compare those described feature selection techniques.

4.6.4 Classification techniques related works

[162] presents a review of the research on the automated diagnosis of 5 neurological disorders in the last 20 years using AI. Those disorders are Epilepsy, PD, AD, Multiple Sclerosis and Ischemic Brain Stroke. The reviewed papers work with physiological signals and images. In the review, they collect the methodology applied or features extracted, the classifier used, and the performance obtained. They make seven summaries of Computer Aided Diagnosis (CAD) systems for: Epilepsy using EEG (112 studies), PD using measurable indicators (9 studies), PD using brain images (8 studies), PD using physiological signals (38 studies), AD using MRI (37 studies), Ischemic Brain Stroke using MRI (23 studies) and Multiple Sclerosis using MRI (8 studies).

[163] elaborates a review of DL techniques used to detect epileptic seizures from intracranial electroencephalography (iEEG) or EEG data from humans or animals. At the beginning of this work, they summarize the most important characteristics of popular and available EEG databases for epileptic seizure detection. Then, they present and describe the most promising DL techniques and list several studies that apply those techniques, gathering the name of the DL-technique used, the number of layers, the type of final classifier, and the accuracy obtained. They reviewed 26 studies focused on 2D-CNN, 24 using 1D-CNN, 15 using RNN, 17 using AEs, 9 using convolutional recurrent neural network (CNN-RNN) and 5 using convolutional autoencoders (CNN-AEs). They also collect 8 studies that have used non-EEG-based data, like sMRI, fMRI and PET scans.

4.6.5 Brain disorders related works

In this subsection we show papers focused on the classification of patients with a mental disorder by means of EEG. Those papers review what other studies have done, gathering the signal transformations used, the features extracted and selected, and the classification techniques used. Below we present this subsection divided into the disorders that these works deal with:

Mild Cognitive Impairment. [164] (172 studies in the review) extract 234 studies from 172 reviewed works, which are related with AD and MCI. From each study, they collect the dataset used, the time of prediction, the data type (EEG, MEG or fMRI), the classification algorithm, the number of folds used in the cross-validation technique and some performance metrics like AUC, accuracy, sensitivity and specificity.

[165] composes a systematic review with 82 studies that are focused on Dementia. They present, in a visual manner, the distribution of papers according to the sampling frequency applied, number of study subjects, number of electrodes used, recording time, tools used to process the signal and classification techniques.

[166] uses brain imaging techniques applied to EEG, MEG, MRI and fMRI data to diagnose AD. They elaborate an interesting table with the advantages and disadvantages of some classification and artifacts removing algorithms.

Epilepsy. [167] makes an extended review of the most used signal transformations and feature extraction techniques to detect epileptic seizures and diagnose Epilepsy. In addition, two tables are elaborated with the most relevant data of the studies reviewed: name of features extracted, name of the classifier used and accuracy obtained. The first table contains 21 previous works for the automated detection of normal and epileptic classes. The second table has 17 previous works for the automated detection of normal, interictal and epileptic classes.

[168] describes briefly the most used features in the literature of EEG seizure detection dividing them into time domain, frequency domain and time-frequency domain. Previously, they had summarized 55 studies by writing down in a table the type of features extracted, the transformation signal method and the performance obtained according to the database used (CHB-MIT scalp EEG database and Bonn University database). Eventually, an experiment is performed using the best results of reviewed works.

[169] reviews 87 studies made between 2010 and 2020. This work gathers 58 studies using conventional feature extraction techniques together with ML classifiers. They also review 29 studies that use DL techniques, without handcrafted feature extraction.

[170] does a review divided in two, depending on the type of subjects taken to make the experiment. They take 36 studies that have worked with human subjects and 5 studies that have worked with animal subjects. This work collects information about the goal of the study, the database and the methodology used, the time of prediction and the performance obtained.

[171] offers a comprehensive review of signal processing techniques like preprocessing, feature extraction, feature selection and classification schemes for Epileptic Seizure Prediction (ESP). This work has a summary with recent ESP surveys, from 2016 to 2021 and it collects other works that use the feature selection methods exposed in the review. In addition, the manuscript includes some tables with detailed information about the architecture of artificial neural networks that other studies have used. Moreover, it offers interesting sections with trends and emerging classification techniques, as well as another section focused on the limitations and challenges of ESP.

Sleep disorders. Regarding sleep disorders, [172] focuses on sleep stages and presents several tables that contain information about many previous studies. Those tables collect the feature extraction and feature selection techniques, the classification method, the sleep stages and sleep disorders classified, the number of subjects, the database used, the channels chosen, and the performance obtained for each study. It also carries out a practical experiment by comparing several classifiers.

[173] collects from each study reviewed, the features extracted, the preictal time, the database used, the year the study was conducted, the number of patients, the recording type (iEEG or sclap), the sensitivity, the false prediction rate obtained and the statistical validation used.

4.6.6 Other related works

[174] makes a brief summary of 11 studies focused on diagnosing MDD, taking into account the applied methodology, the extracted features, the classification method used and the accuracy obtained from each one.

[175] does a brief review of studies that investigate how to classify Dyslexia. We could find the test group size and control group size, the age range of the subjects, the number, and the name of picked channels and the signal transformation used in each study reviewed.

5 Discussion

Epilepsy is by far the most studied neurological disorder by means of EEG (Fig. 7). In general, every FE technique taken into account in this work has been applied to the study of Epilepsy (Figs. 13, 14, 15) as well as every ML algorithm (Fig. 16). Compared to Epilepsy, the other brain disorders have been considerably understudied by means of EEG, so we strongly recommend considering them as potential research options. In our opinion, Epilepsy might be the most studied because (i) it affects a considerably large portion of the population, and quickly predicting a seizure could greatly prevent injuries (ii) a seizure is characterized by abnormal excessive activity in some regions of the brain [101], thus it might be easier to predict such periods of high activity than predicting other patterns related to other brain disorders such as ADHD, that could be more difficult to differentiate from the patterns of a neurotypical individual.

Regarding Epilepsy, even though it has been extensively studied, more work could be done applying less-studied FE techniques. For example, (i) by applying any other transformation besides the WT or Mode Decomposition, (ii) using less-studied features such as chaotic, complexity, tensor or graph-related features, and (iii) focusing more on feature selection techniques in general. However, we should be especially careful when choosing new research gaps in the bubble plots, as there may be combinations that are not studied simply because it does not make sense to combine them.

The WT is the most used because of its effectiveness in nonstationary signal analysis. The ability to perform time-frequency decomposition into Wavelet components facilitates the analysis of certain localized patterns that can occur in EEGs. This justifies its use over other transforms such as the FT, or the STFT. Despite its popularity, we can see in Fig. 13 that it has mostly been studied in relation to Epilepsy, so there are research gaps regarding other brain disorders, as it is not trivial to directly apply the same Wavelet analysis in every case. For example, the Daubechies-4 Wavelet is commonly used to detect epileptic spikes [176], but it may not be suitable to study another brain disorder. It is also common to apply several transformations together. For example, as presented in Table 4, there are 22 works where a Mode Decomposition technique is used to decompose the signal into different meaningful components, and then the WT is used to extract useful features in the time-frequency domain.

The most common features used in the models are Statistical Parameters, such as the mean or standard deviation of the amplitude or frequency. Features related to the Entropy, Energy, complexity or chaos are also frequently used because of the good results produced. In general, it seems to be a trend for trying to quantify the degree of disorder of certain regions of the brain. Moreover, features related to correlation and graph connectivity are used to assess how different parts of the brain are interacting.

Table 6 Most common techniques used with brain disorders. The number in brackets indicates how many papers this technique appears in

The SVM is a versatile classifier that is highly effective in high dimensional spaces, i.e., a high number of different features. Typically, the application of FE techniques such as WT or FT to EEG data yields a large number of features, justifying why it might be the main choice in almost every brain disorder (Figs. 12, 16). However, we found that there is an ascending trend of using DL in the latest years, which might be justified by the following advantages over ML:

  • Even though classical ML models can perform classification on temporal data via techniques such as moving window, using Hidden Markov Models or the use of FT and WT, they may not be able to fully process the sequential or temporal nature of EEG data, as most ML algorithms have no notion of order within their input features. On the other hand, there are DL models such as RNN, LSTM, or CNN that could exploit such sequential information [177].

  • DL models are able to automatically learn optimal representations of the input data [70], reducing the burden of manually engineering some of the input features. Despite this, FE is still important to DL, as carefully applying the correct set of FE techniques can considerably reduce the complexity of the model.

Thus, we suggest experimenting with different DL models while also taking into account FE techniques, in order to reduce its complexity. Specifically, we think that there is a lot to be done regarding feature selection techniques applied to DL. As we can see in Fig. 11, there are only 170 out of 905 studies that use any feature selection techniques. Moreover, Fig. 20 shows that only a small subset of the papers use feature selection techniques for DL models.

During the keywording process, we also tried to collect data related to the evaluation of the ML pipelines in order to present insights about the techniques and models that perform better for each brain disorder. However, it was difficult to make an objective comparison, mainly because of (i) differences in the evaluation process across the works and (ii) the lack of a public dataset of reference for some brain disorders. Regarding the evaluation process, we found that, in some works, inter-subject evaluation was performed. In other words, the data of the same subject was used for training and for evaluation of the algorithm, resulting in an artificially higher performance than the real one. Moreover, to be able to compare studies, it is essential to perform the evaluation in the same dataset. To the best of our knowledge, there are public datasets for only 6 out of the 15 brain disorders taken into consideration in this work (epilepsy, sleep disorder, depression, schizophrenia, ADHD and stress). Therefore, it is of crucial importance to create quality public datasets in order to correctly perform comparisons among works. Moreover, creating large enough datasets would greatly contribute to the use of DL architectures, as their performance typically increase with the amount of training data. In summary, we strongly suggest the creation of quality public datasets as a future research opportunity that will be extremely useful for the community.

It is also interesting to compare the frequency of studies of each brain disorder against its prevalence. [3] performed an estimation of the number of individuals in Europe suffering from any of the most 12-month prevalent brain disorders, which were the following: Anxiety Disorders (61.5m), Unipolar Depression (30.3m), Insomnia (29.1m), Somatoform Disorders (20.4m), Alcohol Dependence (14.6m), Sleep Apnea (12.5m), PTSD (7.7m), Dementia (6.3m), ADHD (3.3m) and Epilepsy (2.64m). By comparing the previous list with the frequency of the studies, it can clearly be seen that there is a lot of research to be done related to FE of EEG for the most prevalent brain disorders.

6 Conclusions and future works

In this section we present our findings and conclusions, then we discuss the limitations of this work and finally, we suggest future works based on our results.

6.1 Conclusions

The main goal of this SMS has been to provide a clear overview of the research area of FE of EEG applied to mental disorders. Four RQs were proposed in order to guide such an overview. A total of 6133 initial works were retrieved by searching in five well-known different databases (Scopus, Web of Science, IEEEXplore, ScienceDirect, ACM) through the synthesis of adequate queries. After applying the defined set of exclusion and inclusion criteria, the final set of relevant works resulted in 905 papers, with labels corresponding to each RQ. Finally, the labels were used to extract insights.

The results corresponding to each RQ are as follows:

  • A total of 15 brain disorder groups have been studied by using FE of EEG signals, being Epilepsy the most predominant by far, present in \(68.40\%\) of the studied papers.

  • 14 transformation groups, being WT the most applied in \(31.82\%\) of the papers.

  • 15 groups related to feature extraction, with Statistical Parameters being the most used in \(22.54\%\) of the works.

  • 8 feature selection techniques, PCA being the most used in \(5\%\) of the papers.

  • 14 ML and DL techniques, SVM being the most used, being present in \(33.04\%\) of the studies.

6.2 Limitations

As limitations of this work, we would like to remark that it is focused on FE techniques and it has ignored the preprocessing stage, i.e. eliminating noise, applying filters on the signal, removing artifacts such as blinking, heart rate, etc. Even though preprocessing is crucial, it is out of the scope of this paper as the main goal of the SMS is to provide an overview of the current FE techniques used on EEG data for the diagnosis of mental disorders, and how it is combined with different ML algorithms. Preprocessing should therefore be the goal of another SMS itself. It should be noted that we do not extract the accuracy achieved, nor the contributions and drawbacks of each study. This is mainly because we have only read the abstract of each paper, and we were seldom able to clearly find the contributions and drawbacks. There is also a reason not to collect the accuracy of each paper and it is because we consider it unfair to compare performances among studies carried out with different databases. In addition, we were not able to gather the databases used by reading only the abstract. Moreover, this SMS did not collect information about the year and country of publication, which could have been useful to visualize trends, because most of the papers lacked this data during its fetching and it was an unfeasible task to perform at hand.

6.3 Future work

The analysis was extended further by combining different RQs via bubble plots, which provided a clear visual overview of the research area, useful for researchers and practitioners in deciding on future research options. After carefully analyzing the previous results, insights and recommendations were proposed in the discussion section:

  • It is worth noting that there an are plenty of research gaps concerning an all brain disorders except Epilepsy.

  • As we can see in Fig. 11, feature an selection is by far the less studied technique and its an application is crucial to reduce the complexity of ML models as well as to provide explainability to DL models.

  • The SVM is the most used ML model in an almost every brain disorder since it is efficient when working with an high-dimensional spaces, but we recommend focussing on experimenting with models that can take advantage of the sequential nature of EEG data, such as RNN or CNN-based architectures.

We also discussed the difficulty of an objectively comparing ML pipelines, as the evaluation process differed among works. We suggest an standardizing the evaluation process an by selecting a standard set of metrics, using an the same dataset and ensuring that there is no data an of the same subject in both the training and test set, as this artificially an improves the metrics, thus giving a sense of fake an performance. This also shows the importance of having quality datasets available to the community, an and we strongly recommend the synthesis of quality datasets of EEG data for any mental disorder as a future research direction.

By analyzing the most prevalent brain disorders in Europe, it was discovered that little work, if any, was done in terms of FE of EEGs related to these disorders. Therefore, we would encourage others to keep on researching and advancing on these topics, namely Anxiety Disorders, Unipolar Depression, Insomnia, Somatoform Disorders, Alcohol Dependence, Sleep Apnea, PTSD, Dementia, and ADHD.

Given the current trends, we expect the use of DL architectures to predominate over traditional ML for the automated diagnosis of mental disorders in the near future, as they have more predictive capabilities and benefit from large datasets. As DL models require large amounts of data to be trained, collecting large-scale quality datasets of EEG recordings will be even more important. We also expect a rise in the use of techniques such as transfer learning [178], or domain adaptation [179] to leverage the performance of DL models. Additionally, we also expect future work on XAI techniques applied to EEG, as explainability is crucial for the application of DL into clinical diagnosis tools. FE will have a critical role in both aspects: FE will be required to fully take advantage of the future large-scale datasets, whereas FE could merge with XAI techniques to explain DL models in terms of features that are understandable to the domain expert.