1 Introduction

Electroencephalography (EEG) has been a key tool in the diagnosis and treatment of brain disorders in the medical field. The analysis of EEG signals has been known as the most prevalent approach to the problem of extracting knowledge of brain dynamics. Visual inspection is not an appropriate method for accurate and reliable diagnosis and interpretation of such complex EEG data as it is time-consuming, burdensome, dependent on expensive human resources, and prone to error and bias. Therefore, extensive research is needed to develop a method for automatically diagnosing brain disorders. The computer-aided diagnosis (CAD) systems have been used to perform an automatic neurophysiological assessment for detecting abnormalities from EEG signal data. The CAD system comprises four primary components, which include data preprocessing, feature extraction, dimensionality reduction, and classification processes. Each of these components is illustrated in Fig. 1 and is briefly described below.

Fig. 1
figure 1

Workflow diagram of the comparative study

  • Data preprocessing: This component involves refining input data by eliminating noise and inconsistencies, enhancing data quality through techniques like denoising and normalization, and integrating diverse data sources to enable comprehensive analysis.

  • Feature extraction: This entails identifying and representing relevant characteristics within preprocessed EEG data. These characteristics can include frequency components, statistical measures, or other patterns that are essential for distinguishing different brain activities or conditions. Feature extraction plays a crucial role in preparing EEG data for effective classification algorithms and aiding in tasks like detecting seizures, sleep stages, or cognitive states.

  • Dimensionality reduction: This refers to techniques that reduce the number of EEG features while preserving essential information. This helps simplify analysis, improves model efficiency, and prevents overfitting, making it easier to classify brain activities or disorders accurately. It mainly includes two techniques (1) feature selection which focuses on choosing a subset of the most informative features from the original set while discarding less relevant ones. (2) feature reduction which focuses on transforming the original EEG features into a lower-dimensional representation, often using techniques like principal component analysis (PCA) or independent component analysis (ICA).

  • Classification: It refers to a machine learning task in which a model is trained to assign labels to data by learning from its inherent characteristics. Once trained, the model can leverage this acquired knowledge to label new, previously unlabelled data.

Therein, feature extraction, selection and classification are necessary, while the other components are optional.

For pre-processing, which is primarily responsible for suppressing noise in information signals, several techniques have been used by researchers over the past few years. For a thorough analysis of methods employed for removing noise from EEG signals, readers can refer to ( Jiang et al. (2019)). The second and most crucial component of any CAD system is feature extraction, for which a wide range of methods (time domain, frequency domain, and time-frequency domain methods) have been reported in the literature to classify numerous brain disorders from EEG signals. Fourier transform (FT) based techniques are developed for frequency domain analysis of EEG signals, but these techniques do not offer time-domain data. The autoregressive (AR) methods are computationally efficient, but they have limitations that prevent them from being used in actual CAD systems. Thus, time-frequency signal-processing algorithms like discrete wavelet transform ( Sharmila and Geethanjali 2016), tunable-Q wavelet transform (TQWT) ( Anuragi and Sisodia 2017), empirical mode decomposition ( Thilagaraj and Rajasekaran 2019), empirical wavelet transform ( Anuragi and Sisodia 2020), Fourier-Bessel series expansion-based empirical wavelet transform (FBSE-EWT) ( Bhattacharyya et al. 2018) methods were used to extract features in the time and frequency domain for analyzing several brain disorders based EEG signals. Some commonly used non-linear features extracted from decomposed EEG signals are Hjorth parameters ( Mert and Akan 2018), approximate entropy ( Krishnan et al. 2020), Renyi entropy ( Sharma et al. 2014), line length entropy ( Esteller etal. 2001), and norm entropy ( Anuragi et al. 2020). The standard methods involve the extraction of hand-crafted features from a large number of decomposed EEG signals by advanced wavelet methods using multiple-channel EEG signals. Thus, this results in the high-dimensionality feature vector.

1.1 Dimension reduction

Dealing with high-dimensional data can pose various challenges for AI-based machine learning (ML) models, affecting their ability to accurately classify, recognize patterns, and visualize information. Consequently, this paper emphasizes the importance of dimension reduction.

  • Significance of dimension reduction: Learning with high-dimension features can become difficult due to high computational complexity. The curse of dimensionality refers to the fact that if the amount of data used to train a model is fixed, increasing dimensionality can lead to overfitting, which lowers classification success rates. This issue can be resolved by collecting exponentially more data for each additional dimension it is not always possible hence feature dimensionality reduction method is adopted by many researchers.

  • Feature projection versus selection: The dimensionality reduction method involves two approaches that are feature selection and feature reduction. Feature selection refers to picking out the most relevant aspects of the original signal, while feature reduction transforms the EEG data into a lower-dimensional representation ( Li et al. 2017.)

This review study exclusively focuses on reduction techniques because of the prevalence of highly correlated features in EEG signals, stemming from the intricate nature of brain activity. Feature reduction methods play a pivotal role in this context as they not only mitigate multicollinearity caused by numerous electrodes but also excel at noise filtration, amplifying the importance of key EEG components. Ultimately, this emphasis on feature reduction is driven by its potential to significantly boost the accuracy of classification models. Numerous researchers have shifted their attention towards dimensionality reduction, as illustrated in the overview of the relevant literature.

In the study ( Razzak etal. 2019; Sadiq et al. 2019; Peng et al. 2021; Peng etal. 2020), several feature combinations are evaluated for classification enhancement by reducing the dimension of a large feature matrix. Furthermore, a number of dimension-reduction techniques ( Van Der Maaten et al. 2009), including feature selection and feature transformation (projection), is used to select the most effective features for classifying EEG signals. In the study ( Zhang et al. 2019), authors examined six dimensionality reduction algorithms such as ICA, isometric feature mapping (ISOMAP), PCA, kernel PCA (K-PCA), locally linear embedding (LLE) and Laplacian eigenmaps (LE) to reduce the dimension of the features. Then these features were evaluated by least squares support vector machine (LS-SVM) classifiers. The findings demonstrated that, when compared to other methods, ICA had the highest classification accuracy for classifying epilepsy-seizure EEG signals. The authors presented work for motor imagery classification using the flexible analytic wavelet transform (FAWT) method in ( You et al. (2020)), where time-frequency features were extracted, future-reduced using multidimensional scaling (MDS), PCA, K-PCA, LLE, and LE, and then classified using a linear discriminant analysis (LDA) classifier, achieved the highest classification accuracy of 94.29% from MDS reduction method. For EEG-based focal detection, Raghu and Sriraam introduced a work ( Raghu and Sriraam 2018) it extracts 23 sets of features belonging to the time, frequency, statistical, and time-frequency further features are reduced neighborhood component analysis (NCA) using support vector machine (SVM), achieved the highest classification accuracy of 96.1%. In order to reduce the features derived from the bispectrum of the focal and non-focal EEG signals, ( Sharma et al. (2019)) utilized the locality-sensitive discriminant analysis (LSDA) data reduction technique. These features were then passed to SVM classifiers for performance evaluation and achieved 96.2% accuracy. In the study ( Jiang etal. 2022), the authors employed a convolutional autoencoder (CAE) for deep feature extraction and dimensionality reduction for children’s focal epilepsy EEG classification. ( Akbari et al. (2021)) used the forward selection algorithm (FSA) to reduce the features derived from various geometrical features using phase space dynamic (PSD) analysis for the classification of schizophrenia EEG signals. The k-NN classifier attained the highest classification accuracy of 94.80%. In the study, ( Prabhakar etal. (2020)) extracted nine different non-linear features, which resulted in high-dimensional features. Four optimization algorithms, including artificial flora (AF) optimization, glowworm search (GS) optimization, black hole (BH) optimization, and monkey search (MS) optimization, were used to determine an optimal number of features to improve SVM classifier performance. This results in a maximum accuracy of 92.17%. ( Mumtaz et al. (2016)) computed absolute power (AP) and relative power (RP) features from multiple channel alcoholic and non-alcoholic EEG signals using a fast Fourier transform (FFT), yielding 133 high dimensional features. PCA was used to reduce the most significant features, which were then used for classification using the logistic model trees classifier, which achieved a maximum accuracy of 96%. ( Patidar et al. (2017)) proposed a framework for alcoholic EEG signal classification in which centered correntropy features were extracted from the TQWT method, and dimension was reduced using PCA before being fed to LS-SVM classifiers for performance evaluations, achieved the highest classification accuracy of 97.02%. The framework for major depressive disorder (MDD) classification based on EEG signals was developed by ( Saeedi et al. (2020)), which uses a genetic algorithm (GA) to select the significant features from feature vectors produced by sample entropy and approximate entropy on decomposed signals derived from wavelet packet coefficients. The proposed framework achieved 94.28% accuracy using enhanced k-NN. Authors in this study ( Mahato and Paul 2019) have investigated linear and non-linear features for classifying MDD EEG signals. It was observed that PCA was used to effectively reduce the feature’s dimension after both features had been combined. The radial basis function network (RBFN) classifier was then applied to the reduced features, and it achieved the highest accuracy of 93.33%. Another study for MDD classification was carried out by ( Raghavendra et al. (2023)), where high-dimensional features were extracted using a continuous wavelet transform (CWT). Thereafter K-PCA and PCA analysis techniques were employed to reduce the dimension. They were then evaluated using an SVM classifier. The highest classification accuracy was 99.33% with K-PCA techniques. Authors in ( Ray et al. (2021)) has reviewed various feature reduction techniques for high dimensional data analysis. Face recognition is another area where dimension reduction techniques have been examined ( Gu et al. 2016).

1.2 Major contributions

The majority of the methods for classifying brain disorders based on EEG signals in the literature that we have discussed have focused on performing feature extraction tasks using various wavelet transform techniques. The extracted features are not properly chosen because of the dimensionality of the features, despite the fact that all of these methods increase computational complexity. As the dimension of the features increases, so does the classification complexity. More work on dimensionality reduction is needed to overcome these drawbacks and achieve better classification results. There have been a few attempts by authors to address this issue, but the majority of them used traditional PCA or K-PCA ( Mumtaz et al. 2016; Raghavendra et al. 2023; Zhang et al. 2019). Furthermore, most of the studies in the literature investigated the reduction techniques only on one EEG dataset. Thus, in this study, we conducted an extensive empirical review of the effectiveness of 23 different projection techniques on five different EEG signal-based brain disorders (shown in Fig. 1). The objective of this study is to provide end users to select the most appropriate feature projection technique for their application.

The key findings and contributions of this review study are summarized as follows:

  • To the best of our knowledge, this article presents the first empirical review of existing reduction techniques employed on diverse EEG datasets, thereby enhancing the versatility of this study.

  • A comprehensive review of 23 individual and combinational projection techniques for high-dimensionality features derived from EEG is conducted here.

  • These techniques are evaluated using three performance matrix such as method classification average accuracy, a number of reduced dimensional features, and dimensionality reduction rate (DRR).

  • The key findings of the empirical review have been discussed and summarized the results in the form of tables and plots.

  • The study recommended that PCA-t-SNE outperforms other studied techniques in mitigating the curse of dimensionality of classifiers on overall EEG datasets.

The paper’s remaining sections are structured in the following manner: Sect. 1 provides an initial overview of the necessary dimensionality reduction. Section 2 offers a brief introduction to a variety of EEG datasets that have been considered. Detailed descriptions of the linear and non-linear projection techniques, the classifiers employed, and the performance evaluation metrics can be found in Sect. 3 and Sect. 4, respectively. Section 5 presents a comprehensive analysis of the results and discussions regarding the performance of each projection technique. Finally, Sect. 6 concludes the paper.

2 Experimental datasets

As depicted in Fig. 1, the very first step of the study is the datasets on which the projection techniques are assessed. In this study, the types of observations, dimensionality, and intrinsic dimension ratio are three traits that are taken into consideration. These are based on previous surveys and papers on reducing dimensions. All the traits are described below.

  • # Observation (N): The variable N represents the number of observations in the dataset. Three distinct categories are established based on N:

    • Small: When N is less than 1000 in the dataset.

    • Medium: When N is greater than 1000 and less than or equal to 4000 in the dataset.

    • Large: When N is greater than 4000 in the dataset.

    These ranges correspond to the dataset sizes that are commonly used in research papers to evaluate projection methods.

  • # Dimensionality (d): The variable d represents the dimensionality of the dataset, indicating the number of features. In this study, three distinct categories are established:

    • Low: When d is less than 100 in the dataset.

    • Medium: When d is greater than 100 and less than or equal 500 in the dataset.

    • High: When d is greater than 500 in the dataset.

  • # Intrinsic dimension ratio (\(\sigma _{d}\)): It is the ratio of the number of principal components that PCA determined with 95% of the data variance and the total number of features (d). This ratio ranges between 0 to 1. Higher values typically indicate that a projection is having difficulty mapping the data. Three different ranges are defined here as:

    • Low: When \(\sigma _{d}\) is less than 0.1.

    • Medium: When \(\sigma _{d}\) is greater than 0.1 and less than or equal 0.5.

    • High: When \(\sigma _{d}\) is greater than 0.5 and less than or equal 1.

Table 1 Description of considered five dataset set and their pre-processing approaches
Table 2 Statistic of binary class classification datasets considered for the study after pre-processing

In this study, we have chosen a smaller set of five EEG datasets on which we have already performed pre-processing and feature extraction and have obtained high-dimension feature vectors that suffer from the dimensionality problem ( Anuragi 2023). A description of the five sets of datasets is provided in Table 1. For analyzing EEG signals in the majority of the datasets, the FBSE-EWT ( Bhattacharyya et al. 2018) method was used. For Schizophrenia EEG signal classification, a multivariate FBSE-EWT (MFBSE-EWT) method was developed as an extension of the FBSE-EWT method for multiple channels. The TQWT method was used in depression detection to decompose EEG signals into sub-band signals. After decomposing EEG signals, various features were computed to achieve the highest performance of the classifiers. The computed features from each dataset are shown in Table 1. The dimensionality and number of observations of the obtained features, along with their types from each dataset, is depicted in Table 2.

3 Projection techniques

The process of dimension reduction (DR) involves projecting high-dimensional data into low-dimensional data. Different projection techniques are now frequently used in numerous applications due to the increase in the dimensionality of the features. DR techniques preserve as much of the original information of the data as possible while projecting the original high-dimensional data into a new low-dimensional dataset.

The problem of the “curse of dimensionality” can be mitigated by the new projected low-dimension representation of the original dataset. Table 3 provides a list of twenty-three projection techniques that are reviewed in this article. Their linearity type, learning type, neighborhood, computational complexity, tuning parameter requirements, and topology are also summarized in Table 3. Here, the linearity type is grouped into linear and non-linear types. Linear types are simple to implement and understand, but they cannot capture well sample distributions spread on complex manifolds in \(p_{D}\), whereas non-linear type projections perform better on such datasets, but their parameters are more difficult to control. Similarly, the learning type is grouped into supervised and unsupervised learning. In the supervised learning type, the techniques use the labeled information, while in unsupervised learning, it projects the original data without the label. The projection techniques assert that they preserve two neighborhood that is local and global. The categorization of 23 techniques based on neighborhood is demonstrated in Table 3. They attempt to retain the distance between the points and their neighbors in data when using local neighborhood methods. However this may help to distinguish the clusters, but the distances between clusters in the projected space become meaningless ( Van der and Hinton 2008). Likewise, global methods try to maintain a pairwise distance between all the points. This may yield more accurate projections of high-dimensional space but exhibit less effective cluster separation ( Frey and Pimentel 1978). The computational complexity of each technique in \(\mathcal {O}(\cdot )\) as a function of N and d are shown in Table 3. For interactive visual exploration, low-complexity techniques work best, but they may find it difficult to produce reliable results.

Here is a concise explanation of the diverse DR projection methods explored in this study. They are primarily categorized into two types: linear and non-linear. Each type is described in detail in the following section:

3.1 Linear types

3.1.1 PCA

Pearson, in the study ( Pearson 1901), introduced PCA, and hoteling developed its future ( Hotelling 1933). This research investigates various forms of PCA, including randomized PCA ( Feng etal. 2018), sparse PCA ( Zou et al. 2006), incremental PCA ( RossD and LimJ 2008), and kernel PCA ( Schölkopf etal. 2005), offering concise descriptions of each. They are all unsupervised learning techniques that use orthogonal transformations to obtain a new set of uncorrelated variables. The basic algorithmic steps involved in PCA are demonstrated in Algorithm 1.

  • Randomized PCA: It is an approximation of traditional PCA that employs random sampling techniques to select a subset of data, making it computationally more efficient for large datasets while still providing a close approximation of the principal components and their variances.

  • Sparse PCA: It is an extension of traditional PCA that enforces sparsity in the loading coefficients, encouraging a solution where most coefficients are zero. This promotes a more interpretable and concise representation of the data’s principal components.

  • Incremental PCA: This is employed for processing EEG data with large dimensionality by handling it in batches, facilitating the analysis of high-dimensional EEG datasets that might not fit into memory at once. This makes incremental PCA suitable for managing the complexities of EEG data in real-time or streaming applications.

Algorithm 1: PCA

Step 1:

Get the data D of \(N\times d\) matrix

Step 2:

Compute the covariance matrix

Step 3:

Compute the eigenvectors and eigenvalues of the covariance matrix

Step 4:

Choosing principal components and forming a feature vector

Step 5:

Deriving the new projected data \(D_{n}\)

3.1.2 LDA

In this study, LDA is the only supervised learning type method examined. It captures global geometrical information while ignoring the geometrical variation of local data points within the same class ( Balakrishnama and Ganapathiraju 1998). The basic goal of LDA is to determine a linear transformation matrix to map high-dimensional data to low-dimensional data by following the fundamental steps mentioned in Algorithm 2.

Table 3 Conceptual comparison of feature projection algorithms

Algorithm 2: LDA

Step 1:

Get the data D of \(N\times d\) matrix and label of \(N\times 1\).

Step 2:

Compute each class’s mean vector

Step 3:

Compute the total mean vector

Step 4:

Compute within-class scatter

Step 5:

Compute between-class scatter

Step 6:

Compute eigenvectors with corresponding eigenvalues sorted in non-decreasing order.

Step 7:

Deriving the new projected data \(D_n\)

3.1.3 NMF

In 1994, Paatero and Tapper introduced NMF, an unsupervised linear reduction technique, which was later popularised by ( Lee and Seung (2000)). NMF is a technique for factorizing a non-negative data matrix into the product of two lower-rank matrices W and H such that the resulting matrix approximates the optimal data solution. Then, the value of both matrices is iteratively updated with t as an index of the iteration, so that their product gets closer to the original data. The technique preserves the data’s structure and ensures that its weight and basis are positive. After a certain number of iterations or when the approximation error converges, NMF stops. The basic algorithm of NMF is also illustrated in Algorithm 3.

Algorithm 3: NMF

Step 1:

Get the data D of \(N\times d\) matrix and label of \(N\times 1\).

Step 2:

Initialize two non-negative factor matrix

Step 3:

Update \(W\left( t\right) =update(D,H\left( t-1\right) ,W(t-1))\)

Step 4:

Update \({H(t)}^T=update(D^T,\ {W\left( t\right) }^T,{H(t-1)}^T)\)

Step 5:

Repeat step 3 and step 4 until it reaches its stopping criteria.

3.1.4 F-ICA

F-ICA, also referred to as Fats ICA, is a linear unsupervised reduction technique introduced by researcher in the study ( Hyvarinen 1999). F-ICA, like the majority of ICA algorithms, uses a fixed-point iteration scheme to rotate the whitened data orthogonally to maximize the rotation’s non-Gaussianity. Individual tests have demonstrated that the F-ICA algorithm’s fixed-point iteration scheme is 10–100 times faster than gradient-based ICA and does not need to choose a step size. Formally, the algorithmic process is described in Algorithm 4.

Algorithm 4: F-ICA

Step 1:

Get the data D of \(N\times d\) matrix and label of \(N\times 1\).

Step 2:

Center and whiten the matrix D

Step 3:

Initializing weight vector randomly

Step 4:

Updating weight vector

Step 5:

Normalize the updated weight vector (repeat from step 2 until the weight vector is not converged)

3.1.5 FA

FA ( Lee and Seung 2000), an unsupervised linear reduction technique, aims to model observed variables and their covariance matrix in terms of a smaller set of fundamental factors. The fundamental steps involved in FA are illustrated in Algorithm 5.

Algorithm 5: FA

Step 1:

Get the data D of \(N\times d\) matrix and label of \(N\times 1\).

Step 2:

Construct the initial matrix

Step 3:

Constructing correlation matrix

Step 4:

Compute Eigenvalues

Step 5:

Determine the number of factors

Step 6:

Compute the factor load matrix

Step 7:

Estimable factor analysis model

3.1.6 LPP

LPP is an unsupervised linear dimensional reduction technique based on the linear approximation of the nonlinear Laplacian graph eigenmap ( He and Niyogi 2003). Then, utilizing the concept of the graph’s Laplacian, a transformation matrix is computed, which maps the data points to a subspace. In a certain sense, this linear transformation preserves local neighborhood information optimally. Formally, the algorithmic process is described in Algorithm 6.

Algorithm 6: LPP

Step 1:

Get the data D of \(N\times d\) matrix and label of \(N\times 1\).

Step 2:

Construct the adjacency graph

Step 3:

Select the weight

Step 4:

Compute the eigenvector and eigenvalues

Step 5:

Deriving the new projected data \(D_n\)

3.2 Non-linear types

3.2.1 Kernel PCA

In kernel PCA, rather than computing the covariance matrix, it calculates the principal eigenvector of the kernel matrix; this property makes it appropriate for non-linear data mapping ( Schölkopf etal. 2005).

3.2.2 t-SNE

t-SNE ( Hinton and Roweis 2002), an unsupervised non-linear reduction technique, was presented by Hinton and Roweis. Based on the matching distance between the distribution, t-SNE captures the majority of the local structure of the high dimensional data by revealing a global structure and can work with manifold learning. The fundamental steps involved in t-SNE are illustrated in Algorithm 7.

Algorithm 7: t -SNE

Step 1:

Get the data D of \(N\times d\) matrix and label of \(N\times 1\).

Step 2:

Compute pairwise affinities under taken perplexity

Step 3:

Initialize new data \(D_n\) randomly

Step 4:

Compute low-dimensional affinities

Step 5:

Compute gradient by minimizing the cost function

Step 6:

Deriving the new projected data \(D_n\).

3.2.3 UMAP

UMAP produces a topological representation of high-dimensional data by patching together local manifold approximations and fuzzy simplicial set representations. An equivalent topological representation can be constructed from a low-dimensional data representation. To minimize cross-entropy between topological representations, UMAP optimizes the data representation layout in low-dimensional space. The basic algorithm of UMAP is depicted in Algorithm 8., and for the detailed background of UMAP, the readers may refer to ( McInnes etal. (2018)).

Algorithm 8: UMAP

Step 1:

Get the data D of \(N\times d\) matrix and label of \(N\times 1\).

Step 2:

Initialize embedding

Step 3:

Compute Local Fuzzy simplicial set

Step 4:

Compute probabilities of the points being nearest neighbors

Step 5:

Optimize embedding

3.2.4 MDS

MDS, an unsupervised non-linear DR technique that retains a similarity measure between pairs of data points, was presented by Kruskal and Wish in ( Carroll and Arabie (1998)). It has been utilized for exploratory analysis, multivariate analysis, and data visualization. The stress function, which in MDS is a sum of square errors between dissimilarities and their respective embedding inter-vector distance, must be optimized in order to transform data. The basic steps of MDS are shown in Algorithm 9. There are other supersets of MDS, known as non-matrix MDS (NMDS). Unlike metric MDS, non-metric MDS finds a non-parametric monotonic relationship between the dissimilarities in the item-item matrix, the Euclidean distances between items, and the location of each item in the low-dimensional space. Metric and non-matriculated MDS are examined in this study using Manhattan and Euclidean distance matrices, also known as MDS-E, MDE-M, NMDS-E, and NMDS-M.

  • Metric multidimensional scaling with Euclidean distance (MDS-E): MDS-E focuses on transforming data into a lower-dimensional space while preserving the pairwise Euclidean distances between data points. It is effective for capturing linear relationships within the data.

  • Metric multidimensional scaling with Manhattan distance (MDS-M): MDS-M, similar to MDS-E, aims to represent data in a lower-dimensional space, but it preserves pairwise Manhattan distances instead of Euclidean distances. This is particularly useful when dealing with data where Manhattan distances are more appropriate, such as in city-block distance metrics.

  • Non-metric multidimensional scaling with Euclidean distance (NMDS-E): NMDS-E is a non-metric variant of multidimensional scaling that focuses on preserving the rank order of pairwise Euclidean distances rather than the actual distances. It is robust to outliers and suitable for data where exact distances may not be meaningful.

  • Non-metric multidimensional scaling with Manhattan distance (NMDS-M): NMDS-M, similar to NMDS-E, is a non-metric approach but with Manhattan distances. It is suitable for situations where the rank order of Manhattan distances is more relevant than the exact distances, offering flexibility in capturing the underlying structure of the data without assuming a linear relationship.

Algorithm 9: MDS

Step 1:

Get the data D of \(N\times d\) matrix and label of \(N\times 1\).

Step 2:

Compute centering matrix

Step 3:

Determining the largest eigenvalues and corresponding eigenvectors

Step 4:

Compute the square root of the dot product of the matrix of eigenvectors and the diagonal matrix of eigenvalues

Step 5:

Deriving the new projected data \(D_n\)

3.2.5 LLE

Using a local graph-based approach, LLE, as a linear combination of reconstruction, primarily preserves the local structure of the data ( Roweis and Saul 2000). LLE is based on the linear approximation of all data points by a convex linear combination of their neighbors. The modified LLE (M-LLE), a variant of the LLE, was introduced by ( Zhang and Wang (2006)), which uses multiple linearly independent local weights. Algorithm 10 provides the readers with the fundamental steps of the LLE technique.

  • Modified LLE (M-LLE): It differs from basic LLE by incorporating adjustments for improved stability. It preserves local relationships by identifying neighborhoods, computing weights for linear combinations of neighbors, and optimizing a lower-dimensional representation with modifications to enhance robustness.

Algorithm 10: LLE

Step 1:

Get the data D of \(N\times d\) matrix and label of \(N\times 1\).

Step 2:

Compute the neighbors of each data point \(D\left( i\right)\).

Step 3:

Compute the weights W that best reconstruct each data point \(D\left( i\right)\) from its neighbors by minimizing the reconstruction error rate.

Step 4:

Compute the vector \(D_n(i)\) best reconstructed by the weights, minimizing the quadratic form by its bottom non-zero eigenvectors.

3.2.6 ISOMAP

ISOMAP is a well-known non-linear DR technique for determining the intrinsic structure of the data from manifold learning. ISOMAP ( Tenenbaum and Silva Langford 2000) doesn’t learn the embedding directly in the target space. Instead, it tries to explicitly model non-linear relationships between points in close proximity in terms of geodesic distances. By linear approximating the non-linear manifold, geodesic distances can be learned. The essential step of ISOMAP is demonstrated in Algorithm 11.

Algorithm 11: ISOMAP

Step 1:

Get the data D of \(N\times d\) matrix and label of \(N\times 1\).

Step 2:

Compute an undirected k-neighborhood graph from the k points with the smallest dissimilarity D(i) to and use this dissimilarity as the edge’s weight.

Step 3:

By calculating the shortest paths through the k-neighborhood graph, the geodesic distances matrix is determined.

Step 4:

Derive the new projected data \(D_n\) using geodesic distances matrix and metric MDS, as shown in Algorithm 6.

3.2.7 SOM

Teuvo Kohonen presented SOM in ( Kohonen (1990)), and it has since become one of the most popular non-linear unsupervised neural network algorithms for tasks like clustering, dimensionality reduction, and feature detection. In SOM, the dissimilarity of two instances in a data set with mixed-type features can be assessed separately for the numerical and categorical features. For numerical features, the dissimilarity can be calculated using the squared Euclidean distance, while for categorical features, the number of mismatches is used. Normalization is typically done prior to computing the distance matrix to make sure that each feature has an equal impact on distance. The fundamental step for applying the SOM was taken after the pre-processing step, as shown in Algorithm 12.

Algorithm 12: SOM

Step 1:

Get the data D of \(N\times d\) matrix and label of \(N\times 1\).

Step 2:

Initializing weight vector and initial winner neighborhood

Step 3:

Draw random input vector from D

Step 4:

Determine the winning neighborhood that has a weight vector closest to the D

Step 5:

Update the weight vector (repeat from step 3 until the feature map stops changing)

In addition, hybridized dimensionality reduction techniques using PCA, a linear reduction technique as the pre-reduction method, and the most robust non-linear reduction techniques as the integration partners, such as PCA+t-SNE ( Khagi etal. 2018), Kernel PCA+t-SNE, and PCA+UMAP ( Khagi etal. 2018), have also been explored and described here.

  • PCA+t-SNE: In EEG signal classification, combining PCA and t-SNE optimally reduces features. PCA captures global patterns, and t-SNE highlights local relationships, enhancing the distinction of subtle patterns among different EEG signal classes, resulting in a more effective low-dimensional representation for improved classification.

  • Kernel PCA+t-SNE: The key distinction between PCA + t-SNE and kernel PCA + t-SNE lies in the initial dimensionality reduction step. PCA + t-SNE employs linear PCA, which may not effectively capture complex non-linear relationships. On the other hand, kernel PCA + t-SNE uses kernel PCA in the first step, enhancing its ability to handle non-linearities and making it more powerful in specific cases compared to either technique used in isolation.

  • PCA + UMAP: This combines the efficiency of linear PCA in capturing primary sources of variance with the ability of non-linear UMAP to refine the representation by considering complex relationships. This hybrid approach is particularly beneficial for datasets with a mix of linear and non-linear structures.

4 Classification and evaluation

Following dimensionality reduction, the new projected features are fed into SVM and k-NN classifiers. Both the classifiers are briefly discussed below:

SVM: SVM, a supervised machine learning classifier, was introduced first by ( Cortes and Vapnik (1995)). The main purpose of the SVM is to find the hyperplane that best separates the data points into two classes so that the distance between the hyperplane and the closest data points is maximized. The distance that needs to be maximized is known as the margin, whereas these closest points are technically referred to as support vectors. The Classification Learner app from MATLAB is used in this study to implement SVM with a Gaussian kernel function.

k-NN: A non-linear classifier called k-NN is dependent on two parameters: the number of nearest neighbors and different distance metrics. Euclidean, Minkowski, and Mahalanobis distance matrices are the most frequently used by the k-NN algorithm to improve the performance of the classifier. The k-NN is appropriate for EEG data because it handles large and noisy data easily and only depends on the value of the k and the chosen distance matrix. The Classification Learner app from MATLAB is used in this study to implement k-NN with preset: weighted know by weighted k-NN.

4.1 Performance evaluation matrix

An SVM and k-NN classifier’s efficacy can be viewed visually by means of a confusion matrix, also known as a contingency table or error matrix. The confusion matrix can be used to evaluate the accuracy, precision, recall, and F-measure, but in this study, we only computed the accuracy of the classifier because it has been considered the most important parameter in most studies. Fundamentally, a confusion matrix is a two-by-two table for binary class problems (as shown in Fig. 2) that indicates the number of false positives (\(F_P\)), false negatives (\(F_n\)), true positives (\(T_P\)), and true negatives (\(T_n\)).

Fig. 2
figure 2

Confusion matrix

  1. 1.

    Accuracy: The ratio of the classifier’s accurate predictions to the total of its predictions is known as the classifier’s accuracy. Mathematically it is expressed as shown in Eq. (1).

    $$\begin{aligned} \frac{T_P+T_n}{T_P+F_n+F_p+T_n} \end{aligned}$$
    (1)
  2. 2.

    Reduced features dimension: In this study, we have also noted the number of reduced features.

  3. 3.

    Dimensionality reduction rate (DRR): Identifying the most pertinent and significant features is the purpose of feature selection methods. As a result, the feature dimension can be reduced. The mathematical formulation for computing DRR is shown in Eq. (2).

    $$\begin{aligned} \frac{({Full}_{feature}+{Reduced}_{feature}\ )}{{Full}_{feature}} \end{aligned}$$
    (2)

    where, \({Full}_{feature}\) is denoted by the number of features present in the full feature vector without applying any reduction techniques and \({Reduced}_{feature}\) signifies the number of reduced features achieved after applying the feature reduction techniques. The range of DRR values is from 0 to 1. When DRR is close to 1, it means that the performance is high, and the feature space is reduced. When DRR is close to 0, it signifies the performance isn’t good.

5 Experiment results and discussions

This empirical review has conducted an extensive comparative analysis of 23 projection techniques on a small set of five high-dimensionality features generated from EEG signals for the classification of schizophrenia, alcoholics, focal, focal with deep features, and depression. Both Python and MATLAB are used for the experiments. The pre-processing methods described in Table 1 were employed to extract features in this study. The dimensionality of the obtained features from each dataset is also depicted in Table 2. Following that, the high-dimensional features were projected into low-dimension space by employing 23 reduction techniques, which are demonstrated in Table 3. Several parameters needed to be set for each projection technique; these parameter values are mentioned in Tables 4 and 5.

Table 4 Summary of all the parameters and their values of projection techniques used in this study
Table 5 Summary of all the parameters and their values of projection techniques used in this study (Cont...)

After setting the parameter values, a comparative analysis of their performance using k-NN and SVM classifier with a 10-fold cross-validation approach is conducted, and the achieved results are depicted in Tables 7 and 8, respectively. The tables show three performance evaluation matrices; for a better understanding, an illustration of each cell is shown in Table 6, where the left corner of each cell shows classification accuracy, the upper right corner signifies the number of features, and the lower right corner shows DRR.

Table 6 Representation of each cell in Tables 78
Table 7 Data-wise achieved classification accuracy by the k-NN classifier (columns) for different projection techniques (rows). (Notes accuracy values are in (%)). Reduced feature dimensions are mentioned in the upper right corner of each cell, and DRR in the range of 0 to 1 is depicted in the lower right corner of each cell
Table 8 Data-wise achieved classification accuracy by the SVM classifier (columns) for different projection techniques (rows). (Notes accuracy values are in (%)). Reduced feature dimensions are mentioned in the upper right corner of each cell, and DRR in the range of 0 to 1 is depicted in the lower right corner of each cell

In Table 7, the row represents projection techniques, while the column represents different datasets. Each cell shows the performance matrices, color code of each cell signifies the accuracy values by a sequential colormap. The color red indicates that the technique did not perform well on the corresponding datasets, while the light yellow shows vice versa. Scanning Table 7 along rows demonstrates how the performance of a given projection technique varies across the EEG datasets examined. For instance, we observe that the PCA+t-SNE and F-ICA projections have quite similar (good) classification accuracy across all five datasets. That is, Table 7 depicts a relatively light yellow compact block of cells. In contrast, if we concentrate in Table 7 on the block spanned by the NMDS-E and NMDS-M projection rows, we see less variation along rows and colors that are quite similar (worst), depicted in red color, respectively. More specifically, the method’s average classification accuracy of PCA+t-SNE is 93.36%, which is higher than the average classification accuracy of the full feature vector. Additionally, the employing technique’s impact can be seen in the number of selected features, which reduced up to three features. At the same time, scanning Table 7 along columns shows which projection technique is best for a particularly given dataset. For example, the previously mentioned projection technique PCA+t-SNE performs best in schizophrenia, focal, and depression datasets. This is because these datasets have a low intrinsic dimensionality (see Table 2), and this projection technique handles such data very well. PCA+t-SNE, LLE, F-ICA, and ISOMAP perform well on average for most datasets from the k-NN classifier, whereas NMDS-E, NMDS-M, SOM, and LLP perform poorly.

Further, Table 8 shows a similar observation from the performance analysis of the SVM classifier. While observing row-wise, PCA+t-SNE and F-ICA projections have quite similar (good) classification accuracy across all five datasets. That is, Table 8 depicts a relatively light yellow compact block of cells. In contrast, if we concentrate in Table 8 on the block spanned by the NMDS-E and NMDS-M projection rows, we see less variation along rows and colors that are quite similar (worst), depicted in red color, respectively. PCA+t-SNE, F-ICA, UMAP, and LDA perform well on average for most datasets from the SVM classifier, whereas NMDS-E, NMDS-M, LLP, and kernel PCA perform poorly.

We implemented 23 different linear and non-linear reduction techniques for five sets of EEG signals in this work. Based on the experimental results presented in Tables 7 and 8, a conclusion can be made that PCA+t-SNE performs well on all considered EEG datasets. Using scatter plots (shown in Fig. 3), we have also visualized the results of the top four reduction techniques along with the original features, particularly from the depression EEG dataset, as this dataset performs well in terms of overall techniques. Here three reduced features denoted as RF-# in Fig. 3 are depicted from each technique. Depression and normal EEG signal?s reduced features are represented in two different colors (blue-normal and red-depressed). According to the visualization plot, the PCA+t-SNE technique produces more meaningful embedding than the others, which distinguishes the cluster formation of both classes.

Fig. 3
figure 3

3D visualization of top four dimensionality reduction techniques along with the original features for depression dataset a Original features, b UMAP, c PCA+UMAP, d PCA+t-SNE, and e ISOMAP (Note: RF-reduced features after projection technique is applied)

6 Conclusion

This review article presents a comparative analysis of multidimensional projection techniques from the perspective of end users who want to know how specific algorithms and parameter settings perform on high-dimensional EEG datasets. We reviewed the effectiveness of the recently utilized 23 feature projection techniques (including various combinations of reduction techniques) on high-dimension features derived from five diverse sets of EEG datasets. Performance assessment of these techniques was carried out using SVM and k-NN classifiers, both before (with full feature vector) and after applying reduction techniques. Based on an extensive review and evaluation using three performance metrics, four techniques emerged as the most effective. When paired with the k-NN classifier, PCA+t-SNE, LLE, F-ICA, and ISOMAP showed superior performance. On the other hand, when paired with the SVM classifier, PCA+t-SNE, F-ICA, UMAP, and LDA demonstrated the highest efficacy. The empirical findings, along with the visualization analysis, strongly recommended that PCA+t-SNE is the most effective reduction technique for classifying high-dimensional EEG datasets. Notably, the average classification accuracy achieved with PCA+t-SNE reduction techniques using k-NN and SVM classifiers is 93.36% and 91.98%, respectively, across all five datasets. These results provide valuable insights for researchers in selecting the most suitable reduction technique for high-dimensional EEG features. Moreover, the findings of this empirical review have broader applicability, as they can be extended to other high-dimensional datasets from diverse domains in future research endeavors.