1 Introduction

Globally, nearly 13% of children, 46% of adolescents, and 19% of adults are struggling with mental illness each year [63]. Major depressive disorder (MDD) is a worldwide prevalent psychiatric illness characterized by persistent low mood, lack of pleasure and thought inhibition, cognitive impairment, and even strong suicidal ideation [24]. The World Health Organization (WHO) estimates that the total number of people with depression worldwide is approximately 322 million [68]. The current clinical approach to the diagnosis of depression has many obvious drawbacks, including patient denial, poor sensitivity, subjective bias, and inaccuracy [62]. Studies have shown that the number of patients with major depressive disorder (MDD) increased several-fold during the COVID-19 pandemic [9], highlighting the importance of disease detection and management of depression, while increasing the need for effective diagnostic tools.

Electroencephalography (EEG) is a powerful and well-recognized tool for recording brain activity [8]. In recent years, it has been widely used to study and diagnose various neurological disorders, such as depression [30, 36], epilepsy [61], obsessive-compulsive disorder [43], seizure prediction [35, 48, 49, 66], Alzheimer’s disease [19], Creutzfeldt-Jakob disease [40], stroke [3], sleep analysis [54], Parkinson’s disease [58], schizophrenia [4], mood state analysis [65] and brain-computer interfaces (BCI) [39, 59]. The resting EEG signal can effectively avoid the interference signals generated when the brain receives instructions [18]. The use of resting EEG signals for automatic detection of depression is of core research and application value.

Today, artificial intelligence techniques play a central role in almost all advanced systems [22, 26]. Traditional machine learning methods, such as Decision Trees [46], K-nearest Neighbors [50], Support Vector Machine [20] and Random Forest [23], among others, have achieved good results in many fields. With the improvement of computational performance and the increase of data size, Deep Learning is widely used in many fields such as image recognition [44, 57, 60], semantic segmentation [25], target detection [11], emotion recognition [15] and predictive analysis [34]. Convolutional neural networks (CNN) are at the core of the current best architectures for recognizing image and video data, mainly due to their ability to learn and extract feature representations that are robust to partial translation and deformation of the input data [29]. Recurrent Neural Networks (RNN) and Long Short-Term Memory Networks (LSTM) have shown state-of-the-art performance in many applications involving time series dynamics. [69] combined these two types of networks for video classification, and achieved better results.

In this study, we propose a hybrid neural network based on EEG signals, i.e., convolutional neural network for temporal learning, windowing and LSTM architecture for sequence learning process for automatic detection of depression on 128-channel resting EEG signals. The general process proposed in this study for preprocessing and classifying 128-channel resting EEG signal data is shown in Fig. 1. The paper makes the following key contributions:

  • We developed a novel hybrid neural network model for automatic detection of depression with 128-channel resting EEG signals.

  • We show that the CNN-LSTM model has better depression detection results compared to the models trained by Decision Trees, K-nearest Neighbors, and Support Vector Machine on the same dataset.

  • We show that the classification of 128-channel EEG signals after simple data processing can reach up to 100% for a single participant, which has not only theoretical but also practical significance for automatic depression detection studies.

Fig. 1
figure 1

General procedure of 128-channel resting EEG signal for depression prediction

The rest of the paper is structured as follows: Section 2 discusses related work associated with this paper; Section 3 describes the dataset used in the study and the data preprocessing process; Section 4 proposes a CNN-LSTM model for depression detection; Section 5 deals with the experiments conducted and is compared and discussed with the results of classical machine learning methods on the same dataset; Section 6 summarizes the conclusions and further research work in the future.

2 Related work

During eye movements (sweeping, blinking, etc.), the electrical field around the eye produces a signal called an Electro-occulogram (EOG). Facial muscle movements produce large amplitude electrical signals called Electromyogram (EMG). EOG and EMG signals are often seen as noise or artifacts in Electro-encephalogram (EEG) signals. [13] reviews some methods for dealing with ocular artifacts in EEG, focusing on the relative merits of various EOG correction procedures. [52] describes the basic concepts of wavelet analysis and other applications. [12] proposed a cascade of three adaptive filters based on the least mean square (LMS) algorithm to reduce common artifacts present in EEG signals without removing the important information embedded in these recordings. The dataset used in this paper is designed to eliminate blink artifacts by using an adaptive noise cancellation technique based on the LMS algorithm.

Machine learning and deep learning methods have shown significant results in the study of automated EEG-based depression detection. Table 1 summarizes relevant studies using EEG signaling for depression screening in recent years. Currently, the traditional machine learning methods for EEG-based depression detection include Decision Tree [27], K-nearest Neighbors [10, 31], Bagged Tree [7], Logistic Regression [21], and Support Vector Machine [5, 33, 38, 42, 45], etc. Deep learning models that have shown excellent results in depression prediction are PNN [2, 17, 37], CNN [1, 32, 41, 53, 67], ANN [16, 47], and CNN-LSTM [6, 51, 56, 64]. [6] proposed a deep hybrid model developed using convolutional neural network (CNN) and long short-term memory (LSTM) architectures, which reported 99.12% and 97.66% classification accuracy for the right and left hemispheres, respectively, for a 2-channel EEG signal. [64] implemented the integration of convolutional neural networks (CNN) and long short-term memory (LSTM) for the classification of depression for 64-channel EEG signals. The accuracy of the left and right hemisphere EEG signals was 99.07% and 98.84%, respectively. [51] applied 1DCNN-LSTM for automatic detection of depression on 19 channels of EEG signals with an accuracy of 99.24%. [56] used a convolutional neural network (CNN) for sequential learning processes with temporal learning, additive windowing, and long short-term memory (LSTM) architecture to screen for depression using 64 channels of EEG signals from 21 depressed and 24 normal subjects, achieving 99.10% accuracy.

Table 1 Relevant studies presented in the literature using EEG signals for depression screening

In the studies related to EEG-based depression detection, most of them use 2-channel datasets [1, 5, 6, 17, 27, 47, 53], 19-channel datasets [2, 16, 21, 37, 41, 42, 51, 67], and a few studies use 3-channel datasets [10], 8-channel datasets [33, 38], 16-channel datasets [31], 64-channel datasets [56, 64], and 128-channel datasets [32, 45]. [32] proposed a computer-aided detection (CAD) system using a convolutional neural network to study 128 channels of EEG signals from 24 depressed patients and 24 healthy individuals and reported an accuracy rate of 85.62%. [45] collected and analyzed EEG data from 55 subjects at rest, used an altered Kendall rank correlation coefficient and four classification algorithms, and found that the binary linear SVM classifier performed best with a classification accuracy of 92.73% and an AUC of 0.98.

In this study, we performed a 24-fold cross-validation experiment using SVM, K-nearest neighbor, decision tree, and 2DCNN-LSTM classifiers respectively after simple data processing of 128 channels of resting EEG signals from 24 depressed patients and 24 normal participants. The experimental results show that the accuracy of SVM prediction is 72.05%, KNN prediction is 79.7%, decision tree prediction is 79.49%, and CNN-LSTM prediction is 95.1% on the same dataset, which can be seen that CNN-LSTM performs much better than the traditional machine learning methods.

3 Dataset

The EEG signal used in this study was a multimodal open dataset (MODMA Dataset) provided by Lanzhou University for the analysis of mental disorders. Subject data were obtained with the approval of the local biomedical research ethics committee at the Second Hospital of Lanzhou University (Lanzhou, Gansu Province, China). Written informed consent was obtained from all subjects prior to the experiment. Participants included 24 depressed patients (13 males and 11 females; 16–56 years old) and 24 normal subjects (17 males and 7 females; 18–55 years old), and more information about the participants is listed in Table 2.

Table 2 Information about depressed patients and normal subjects

During the experiment, EEG data were collected from participants with eyes open or closed (resting state) for 5 minutes using a 128-channel HCGSN (HydroCel Geodesic Sensor Net) EEG data acquisition system, and EEG data were recorded using Net Station 4.5.4 software. The sampling frequency was set to 250 Hz for the entire acquisition procedure and the impedance value of the electrodes was below 50 KΩ. The electrodes were placed according to the international standard lead 10–20 electrode system [28] and Cz was set as the reference electrode. 128 channels of electrode placement are shown in Fig. 2. A trap filter was used to suppress the 50 Hz noise, and the EEG signal samples before and after processing are shown in Fig. 3. Blink artifact removal was performed using an adaptive noise cancellation technique based on the LMS algorithm.

Fig. 2
figure 2

Electrode Placement for the International 10–20 System with 128 Channels (a) Electrode placement for International 10–20 system (2D) (b) Electrode placement for International 10–20 system (3D)

Fig. 3
figure 3

Power spectral density plots of raw and processed EEG signals in normal and depressed subjects (a) Raw data of normal subjects (b) Filtered data of normal subjects (c) Raw data of depressed subjects (d) Filtered data of depressed subjects

After preprocessing the data of all subjects, 75,000 (300 × 250) sampling points from each subject were selected as experimental data according to the sampling frequency of 250 Hz to ensure the consistency of data from all subjects. In this study, the 24-fold leave-one-out cross-validation method was used, and in each fold of the experiment, the training data were 2200 files of 22 depressed and 22 normal subjects, each file containing 1500 sampling points. The validation data and test data were 100 files for 1 depressed and 1 normal subject, respectively, each file containing 1500 sampling points. The experimental procedure of the 24-fold leave-one-out cross-validation method is shown in Fig. 4.

Fig. 4
figure 4

Experimental procedure of the 24-fold leave-one cross-validation method

4 Methodology

We took each data file as a subject’s EEG data (6-second resting-state EEG signal of the subject), and then divided each data file into 4-time sequences and inputted them into the model for training analysis, and finally, the model produced a prediction result of “0” or “1” (“0” means the data points belong to non-depressed subjects, while “1” means the data points belong to depressed subjects). The EEG data are segmented with a window length of 1.5 seconds (375 sampling points). Since the sampling rate is 250 samples per second, each EEG segment contains 128 channels and 375 sampling points (window length). the choice of the 1.5-second window size is based on an empirical evaluation of the proposed model, and it is observed that the 1.5-second window size provides the best results.

Figure 5 shows an overview of the 2DCNN-LSTM model we constructed. CNN models are not good at learning sequential information, but are efficient at automatically extracting time-domain features, so we use CNN to extract features from the input EEG signal. The earliest convolutional neural network is the LeNet model proposed by Le Cun [14], and LeNet consists of five main parts, which are the input layer for input images, the convolutional and pooling layers for feature learning, the fully connected layer for integrating the features, and the final output layer of the model. The convolutional layer has three hyperparameters: the size of the convolutional kernel, the step size and the padding. The size of the convolution kernel determines the size of the feature extraction window, the distance that the convolution kernel sweeps sequentially through the input matrix is defined by the step size, and the padding is a way to offset the size shrinkage in the computation. Convolution is calculated as follows:

$$ {Y}_n=\sum \limits_{i=1}^M\left[\left({W}_n^i\ast {x}_i\right)+{b}_n\right] $$
(1)
Fig. 5
figure 5

An overview of the proposed architecture

Where xi denotes the input data, denotes the nth convolution kernel on the ith channel, bn is the bias value, * denotes the convolution operation, and Yn is the nth feature map obtained from the computation.

After each convolution process, the activation function is applied to improve the nonlinearity of CNN. In this paper, the Tanh function is used, and the output range of the Tanh function is known as [−1,1] by Eq. (2), and the output of this activation function is zero mean, which can be used to solve the problem that the output mean of the Sigmoid function is not zero.

$$ \tanh (x)=\frac{2}{1+{e}^{-2x}}-1 $$
(2)

Since the traditional recurrent neural network (RNN) is equivalent to a multi-layer feedforward neural network after expansion, the number of layers corresponds to the number of historical data, and too many layers will bring the problem of gradient disappearance (explosion) and loss of historical information during parameter training, so the actual historical information that can be used by the traditional RNN is very limited. In 1997, Hochreiter & Schmidhuber et al. proposed a long short-term memory (LSTM) network to solve the gradient disappearance problem of RNN [55]. The LSTM, as a part of RNN, is usually used for sequence data learning because it not only learns from training, but also remembers what it learned to predict the next element in the sequence as part of the process and feeds the output back to the network. Therefore, the use of LSTM architecture can go for these important features extracted by CNN.

The LSTM network contains three kinds of gates for control - input gate, forget gate, and output gate. The basic unit of the LSTM model is shown in Fig. 6.

  1. (1).

    forget gate: tanh is selected for the output of the final state, while a function with an output range of [0,1] is selected as the activation function of the gate structure, which determines the degree of retention of the last cell state St − 1. ft value of 0 indicates forgetting and 1 indicates retention, and the calculation formula is as follows:

Fig. 6
figure 6

The basic unit of the LSTM model

$$ {f}_t=\sigma \left({W}_f\cdotp \left[{h}_{t-1},{x}_t\right]+{b}_f\right) $$
(3)
  1. (2).

    Input gate: the cell state St at the latest moment is determined by the previous cell state St − 1 and the new cell state to be selected at the moment gt together. ft and it are used as the weight coefficient terms of St − 1and gt, reflecting the update and forgetting of information in the cell, and the calculation equation is as follows:

$$ {i}_t=\sigma \left({W}_i\cdotp \left[{h}_{t-1},{x}_t\right]+{b}_i\right) $$
(4)
$$ {g}_t=\phi \left({W}_g\cdotp \left[{h}_{t-1},{x}_t\right]+{b}_g\right) $$
(5)
$$ {S}_t={f}_t\cdotp {S}_{t-1}+{i}_t\cdotp {g}_t $$
(6)
  1. (3).

    Output gate: the output gate determines how much of the current state St is output to the current output value ht, calculated as follows:

$$ {o}_t=\sigma \left({W}_o\cdotp \left[{h}_{t-1},{x}_t\right]+{b}_o\right) $$
(7)
$$ {h}_t={o}_t\cdotp \tanh \left({S}_t\right) $$
(8)

Table 3 explains in detail each layer of the proposed CNN-LSTM model and the parameters associated with each layer. To train the proposed model, 32 training batch sizes were chosen. The early stopping criterion was used and the best results were saved. Based on the experimental results, we set the convolutional kernel size to 1 × 3 for the convolutional layers, 32 for the first two convolutional layers, and 16 for the last two convolutional layers.

$$ f(x)=\frac{1}{1+{e}^{-x}} $$
(9)
Table 3 Detailed information about the proposed CNN-LSTM deep model

The dropout layer can discard some information of neurons randomly during the training process, which can significantly reduce the overfitting phenomenon. In this model, dropout is set to 0.5, the initial learning rate of the model is 0.01, Adam optimizer is used for optimization, binary cross-entropy is used as the loss function, and the sigmoid activation function is used for binary classification. The role of the sigmoid function is to transform the input into an output between 0 and 1. According to the Eq. (9) of the function, it is known that as x approaches negative infinity, the function approaches to 0, while the value of the function approaches to 1 as x approaches positive infinity. The hyperparameters involved in the CNN-LSTM model proposed in this study are obtained based on the best results of the experiments.

5 Results and discussion

Figure 7 shows the training process of the CNN-LSTM model proposed in this paper in one cross-validation. It can show that with the increase of iterations, the accuracy and AUC of the CNN-LSTM model on both the training and validation sets are gradually increasing, while the loss is gradually decreasing, and the whole training time is very short.

Fig. 7
figure 7

Performance for one-fold using the EEG data

Figure 8 shows the performance of SVM, KNN, decision tree and CNN-LSTM in the 24-fold leave-one-out cross-validation experiments, respectively, and is described in detail in Table 3. According to the results of the experiments, the average accuracy of SVM, KNN, decision tree and CNN-LSTM on the same dataset is 72.05%, 79.7%, 79.49% and 95.1%, respectively, and the highest accuracy is 95.31%, 94.53%, 92.19% and 100%, respectively, and the classification effect of the CNN-LSTM model proposed in this paper is much higher than that of traditional machine learning methods.

Fig. 8
figure 8

Performance for each fold using the EEG data

The average AUC of the CNN-LSTM model is 0.9803, and the lowest accuracy of 80.21% and the lowest AUC of 0.9044. We further analyzed the experimental results: in each fold of the experiment, the classification of 2 subjects (50 × 2 = 100 test data), the lowest accuracy of classification was 80.21%, i.e., there were at most 20 incorrect predictions during the 24-fold cross-validation, however, each subject actually contained 50 test data, and if we generate a total result for each subject after voting on the 50 predicted results for each subject, then our proposed model will make 100% predictions for each subject.

[32] On the same dataset, a computer-aided detection (CAD) system using convolutional neural network (ConvNet) was proposed and obtained 85.62% accuracy after cross-validating the classification by 24 folds. [45] collected and analyzed 128 channels of resting state EEG data from 55 subjects and used SVM for prediction, obtaining 92.73% accuracy with an AUC of 0.98. The results of the CNN-LSTM model proposed in this paper are as follows: with each subject providing 6 seconds of EEG data, the average prediction accuracy of the model was 95.1% with an AUC of 0.98, the highest prediction accuracy was 100% with an AUC of 1.0, and the lowest prediction accuracy was 80.21% with an AUC of 0.9044. If the subjects provided 300 seconds of EEG data, then after dividing the data into 50 equal parts and inputting them into the model, 50 predictions are obtained, and then voting on the 50 predictions, the final prediction accuracy obtained will be 100% (Table 4).

Table 4 Performance for 24-fold using the EEG data

6 Conclusions

In this study, we propose a 2DCNN-LSTM classifier, and the dataset used for the experiments does not require special processing by professionals, only simple filtering and removal of ocular artifacts for 128 channels of resting EEG signals. The results of the experiments showed that for depression detection of EEG signals of participants with 6 seconds, the average classification accuracy of CNN-LSTM was 95.1% with an AUC of 0.98, and for depression detection of EEG signals of participants with 300 seconds, the classification accuracy of a single participant was 100%. The experimental results outperformed traditional machine learning methods with the same data set and outperformed the literature [32, 45] with similar data sets. In addition, we believe that a model that can perform depression detection without special processing of data will better meet the needs in practical applications, and the research results of this paper have not only theoretical significance but also important practical significance for the study of automatic detection of depression.

As a mainstream research direction in the future, it is necessary to try different network architectures on more datasets to gradually improve our 2DCNN-LSTM model. In the future, we will research other features of EEG signals and other effective data processing and classification methods of depression EEG signals on larger datasets.