1 Introduction

Dimension reduction is crucial in machine learning for simplifying complex data sets (Van Der Maaten et al. 2009), reducing computational complexity (Ray et al. 2021), and mitigating the curse of dimensionality (Talpur et al. 2023), ultimately improving model performance and interpretability. Dimension reduction encompasses two primary approaches: feature selection (Solorio-Fernández et al. 2022), which involves choosing a subset of the most informative features from the original data-set to reduce dimensionality while maintaining interpretability; and feature extraction (Li et al. 2022), a method where new, lower-dimensional features are derived from the original data to capture essential patterns and relationships.

Feature extraction comprises both linear and nonlinear techniques that transform the original data into a lower-dimensional representation. Linear feature extraction such as Factor Analysis (FA) (Garson 2022), Linear Discriminant Analysis (LDA) (Balakrishnama and Ganapathiraju 1998), Principal Component Analysis (PCA) (Abdi and Williams 2010) and Non-negative Matrix Factorization (NMF) (Lee and Seung 2000) involves transforming the input data into a new set of features using linear combinations of the original input features (Wang et al. 2023).

Linear methods are relatively straightforward and computationally efficient. They often provide interpretable results, making it easier to understand the importance of each feature, and are effective when the underlying relationships in the data are approximately linear. However, they capture global correlations, and result in information loss, particularly when the data contains non-linear relationships or interactions between features (Wang et al. 2023). They are also sensitive to outliers, can be computationally expensive, particularly when dealing with high-dimensional data. Their linear projections can be difficult to interpret, and they can be prone to overfitting when the number of input features is significantly greater than the number of observations available (Jha et al. 2023).

In contrast, nonlinear feature extraction utilizes nonlinear transformations of the input features to generate a new feature set that can more effectively capture the underlying patterns present in the data (Wang et al. 2022). By mapping the data into a higher-dimensional feature space, nonlinear methods can find patterns that are not apparent in the original feature space, even when the number of features significantly exceeds the number of samples.

Nonlinear methods also can capture complex relationships between the input features and output variables without the need for domain knowledge or prior assumptions about the data and often leads to better predictive performance (Wang et al. 2022). Manifold-based feature extraction is a nonlinear technique, that relies on the assumption that high-dimensional data can be embedded in a low-dimensional space without losing important information. This is achieved by finding a non-linear mapping that preserves the structure of the data (Li et al. 2022). Some common manifold-based techniques include, ISOMAP (Ding et al. 2022), Locally Linear Embedding (LLE) (Miao et al. 2022) and t-SNE (t-distributed Stochastic Neighbor Embedding) (Meyer et al. 2022). These techniques may not always capture the global structure of the data and its performance is highly dependent on hyperparameter settings.

Another effective method to extract complex, hierarchical, and high-level features from nonlinear data is deep learning. Deep learning models can automatically learn abstract and high-level features, enabling better data representation from raw data and reducing the need for handcrafted feature engineering. They can be used for end-to-end feature extraction and task-specific modelling including image classification, object detection, Natural language Processing (NLP), and speech recognition. In this context, there are several deep learning-based nonlinear feature extraction techniques, some of which are: Convolutional Neural Networks (CNNs) (Molaei et al. 2022), Recurrent Neural Networks (RNNs) (Shi et al. 2022), and Autoencoders (AEs) (Bank et al. 2023). Deep learning models like CNNs and RNNs often require large amounts of labelled data for training and its training can be computationally expensive, requiring powerful hardware. Figure 1 and Table 1 shows various feature extraction methods and their loss functions.

Fig. 1
figure 1

Categorization of feature extraction methods into linear and non-linear approaches

Table 1 Methods for dimensionality reduction

AEs are neural networks that use back propagation algorithm for feature learning. They are primarily used for unsupervised learning tasks, which means they do not require labelled data during training. In contrast, CNNs and RNNs are often used for supervised or semi-supervised tasks, which rely on labelled data. This makes AEs suitable for situations where labelled data is scarce or expensive to obtain (Bank et al. 2020). Furthermore, AEs automatically learn relevant features from the data without the need for manual feature engineering which can save significant time and effort in pre-processing. This encourages the AEs to capture the crucial characteristics of the input data in its encoding, thereby learning a meaningful representation of the data in the latent code (Liu et al. 2023).

AEs also provide a multitude of benefits additionally to dimensionality reduction across various machine learning and data analysis applications mainly used in complex high-dimensional data. They are equally valuable in the context of data compression, where they can efficiently encode information for storage or transmission, making them particularly beneficial for resource-constrained applications. Furthermore, they excel in anomaly detection by quantifying the reconstruction error; instances with elevated reconstruction errors are flagged as anomalies, aiding in the identification of outliers or irregularities within the data (Bank et al. 2020). Data denoising is another strength of AEs. AEs can be trained to eliminate noise or irrelevant information from input data, enhancing data quality. Beyond these applications feature learning, AEs foster a deeper understanding of data through the creation of meaningful representations. They also find practical utility in semantic embedding for NLP and information retrieval tasks and effectively reducing file sizes without compromising quality in image and signal compression (Liu et al. 2023). Furthermore, AEs contribute to privacy preservation techniques, such as differential privacy, by protecting sensitive data while enabling analysis and insights. In addition to these applications, AEs are instrumental in reducing data storage requirements, enhancing interpretability by revealing essential data features, and demonstrating robustness by generalizing well to new data and effectively handling noisy or incomplete data-sets (Liu et al. 2023). Overall, AEs stand as versatile and indispensable tools, offering an extensive array of applications across diverse domains and problem types in machine learning and data analysis.

However, AEs offer a powerful set of capabilities but also come with certain drawbacks that should be considered. One of the main drawbacks of using AEs is that they are sensitive to the choice of hyperparameters, such as the number and size of layers, the learning rate, the loss function, and the regularization. These hyperparameters can affect the performance and the quality of the autoencoder, and may require trial and error or grid search to find the optimal values (Bank et al. 2020). Another common concern with AEs is their lack of robustness. They can be sensitive to noisy data, outliers, and variations in input, which can lead to suboptimal representations and reconstructions (Singh and Ogunfunmi 2022). AEs can be prone to overfitting, especially when trained on limited data. Additionally, they may not inherently preserve the spatial or temporal locality of data during training. This can be problematic for tasks where preserving the local structure is essential, such as image segmentation or sequence modeling (Liu et al. 2023). Furthermore, AEs tend to capture lower-order features and may struggle to represent complex, higher-order relationships in the data. This limitation can impact their performance on tasks that require understanding intricate dependencies (Miuccio et al. 2022).

In recent years, substantial research efforts have been dedicated to addressing these drawbacks through advancements in deep learning and AE techniques. Some of the presented architectures in this area include regularization AEs, robust AE, generative AE, convolutional AE, recurrent AE, semi-supervised AE, graph AE and masked AE. These improvements, as demonstrated in Fig. 2, have caused that the use of autoencoder algorithms in machine learning has gained increasing interest over the years. The graph shows the trend of papers published in the field of "autoencoder" and “machine learning” since 2012, revealing that over 90% of all indexed papers were published between 2018 and 2023.

Fig. 2
figure 2

All published papers in gScholar, Web of Science and arxiv since 2012 with keywords "Autoencoders" and "Machine Learning"

Despite being an important area of research, there is currently a lack of comprehensive studies exploring the applications of AE algorithms in machine learning on a wide scale. While existing review papers have examined specific themes, there has been no comprehensive review conducted. In Table 2, we compare our contribution in this paper to the descriptions of existing review papers in the field.

Table 2 Comparison of our article with the previous review or survey articles

To this knowledge gap, our review will focus on addressing three key research questions:

  • What are the different types of AE algorithms that have been developed and utilized in machine learning applications?

  • What are the main methodological frameworks and the latest achievements in the application of AE algorithms?

  • What are the gaps and future directions in this field, and how can they be addressed to enhance the effectiveness of AE algorithms in machine learning applications?

This review paper represents a significant endeavor to systematically categorize the diverse array of applications of AEs within the domain of machine learning. Furthermore, it embarks on the crucial endeavor of not only elucidating the advantages and challenges associated with these applications but also unraveling the existing frameworks that underpin this evolving field. In this pioneering exploration, we offer the following noteworthy contributions:

  • New taxonomy. In this paper, we propose a comprehensive new taxonomy that categorizes major and modern AE methods within the realm of machine learning into distinct categories in recent years.

  • Comprehensive overview. We not only provide an exhaustive review of the variations within each AE category but also offer detailed descriptions and unified schematic representations. Our in-depth exploration of each approach includes elucidating key equations and presenting pertinent performance comparisons.

  • Abundant resources. We curate and present a valuable collection of AE resources, encompassing open-source code repositories for select reviewed methods, widely recognized benchmark datasets, and performance assessments across datasets with varying label rates.

  • Future trends. we pinpoint unresolved challenges and explore potential directions for future research, drawing insights from recent seminal studies in this field.

This paper is organized as follows. Section 2 provides a concise overview of the structure and hyperparameter in AEs. Section 3 discusses various taxonomies of AEs that have been proposed in the literature. In Sect. 4, we review previous applications of AEs in the machine learning domain, categorizing them according to the task they were used for. In Sect. 5, we review explore publicly available software and platforms that can be used to construct and develop AEs the performance of various autoencoders. Section 6 is dedicated to discussing future directions in the field. Finally, in Sect. 7, we present our conclusions based on the insights gathered from our analysis.

2 Background of autoencoder

AE is a fundamental building block that can be used hierarchically to create deep models. They organize, compress, and extract high-level features, allowing unsupervised learning and the extraction of non-linear features (Chen and Guo 2023). Autoencoders have advantages over Restricted Boltzmann Machines (RBMs) as they can learn more complex data representations. RBMs are widely used for generating various data types, including images (Hinton et al. 2006). RBMs are a type of Boltzmann Machine (BM) that learns a probability distribution from inputs (Chen and Guo 2023). The main difference between Autoencoders, RBMs, and BMs lies in their architectures. AEs have an encoder and a decoder, while RBMs consist of visible and hidden layers. Boltzmann Machines (BMs) are more general and fully connected, making them less tractable compared to RBMs. AEs are feed-forward neural networks, allowing information to flow in one direction. In contrast, RBMs and BMs are generative models capable of generating new samples from the learned distribution.

2.1 Vanilla autoencoder

The concept of AE was initially introduced in a research paper by Rumelhart (1985). AEs are a type of neural network designed for learning and reconstructing input data. In unsupervised learning, the primary goal is to obtain an "informative" data representation. AEs encode input data into a compressed and semantically meaningful form and then decode it to faithfully reconstruct the original input data (Bank et al. 2023). The term "vanilla" is used to describe the simplest form of autoencoder, which has no additional complexities or architectural variations. A vanilla autoencoder typically consists of an input layer, one or more hidden layers, and an output layer (Zhang et al. 2016). You can visualize the structure of a vanilla autoencoder in Fig. 3.

Fig. 3
figure 3

illustrates the structure of an autoencoder, where X represents the input data of the input layer, Z represents the data in the hidden layer, and \(X'\) represents the reconstructed output data in the output layer

During the encoding step, an AE maps an input vector \(X\) to a code vector \(Z\) using an encoding function \(f_{\theta }\). In the decoding step, it maps the code vector \(Z\) back to the output vector \(X'\), aiming to reconstruct the input data using a decoding function \(g_{\theta }\). AEs adjust the network’s weights (\(W\)) through fine-tuning, achieved by minimizing the reconstruction error \(L\) between \(X\) and the reconstructed data \(X'\). This reconstruction error acts as a loss function used to optimize the network’s parameters (Chai et al. 2019). The objective function of an AE can be written as:

$$\begin{aligned} {\min _{\theta } J_{AE}(\theta ) = \min _{\theta } \sum _{i=1}^n l(x_i, x'_i) = \min _{\theta } \sum _{i=1}^n l(x_i, g_{\theta }(f_{\theta }(x_i))) } \end{aligned}$$
(1)

where \(x_i\) represents the \(i\)th dimension of the training sample, \(x'_i\) represents the \(i\)th dimension of the output data, and \(n\) is the total amount of training data. The term "l" refers to the reconstruction error between the input and output, defined as:

$$\begin{aligned} {L(X, X') = \sum _{i=1}^n \Vert X_i - X'_i\Vert ^2} \end{aligned}$$
(2)

The encoder and decoder mapping functions are \(Z = f_{\theta }(X) = s(WX + b)\) and \(X' = g_{\theta }(Z) = s(W'Z + b')\), where "s" is a non-linear activation function like sigmoid or ReLU. \(W\) and \(W'\) are weight matrices, and \(b\) and \(b'\) are bias vectors. During training, the weights and biases of the autoencoder are adjusted to minimize the reconstruction error using an optimization algorithm like stochastic gradient descent. Once trained, the encoding function can create low-dimensional representations of new input data (\(Z\)), while the decoding function can reconstruct the original data from the low-dimensional representation (\(X'\)).

2.2 Stack autoencoder

Traditional AE typically employs a single-layer encoder, making it challenging to extract deep features. To enhance feature extraction, one effective strategy is to deepen the neural network structure. By employing a layer-wise learning approach, multiple basic autoencoders can be stacked together to form a Stacked Autoencoder (SAE), allowing for the extraction of complex data features. The training process of each individual autoencoder involves learning a condensed data representation, with the final output obtained by combining the outputs of these individual autoencoders. Typically, training a Stacked Autoencoder follows a layer-wise approach (Hoang and Kang 2019; Hinton et al. 2006). After training layer 1, it serves as the input for training layer 2. When evaluating the reconstruction loss, it is assessed relative to layer 1 rather than the input layer. The encoding process can be mathematically represented as follows:

$$\begin{aligned} {a^k = f(W_e^k a^{k-1} + b_{e}^k), \quad k = 1 : n} \end{aligned}$$
(3)

in which k represents the k-th autoencoder, \(a_k\) represents the encoding outcome of the k-th autoencoder, and when \(k = 1\), \(a_0 = x\) denotes the input data. The decoding process can be mathematically represented as follows:

$$\begin{aligned} {c^k = f(W^{n-(k-1)} c^{k-1} + b^{n-(k-1)}), \quad k = 1 : n} \end{aligned}$$
(9)

when \(k = 1\), \(c_0 = a_n\), and when \(k = n\), \(c_n = \hat{x}\) represents the reconstructed data of the input variable x (Hoang and Kang 2019).

2.3 Hyperparameters in autoencoder

Autoencoders come with various hyperparameters that must be defined prior to training, and their values can significantly influence the model’s performance. It’s crucial to understand that certain hyperparameters are usually set before training and remain constant, while others can be dynamically tuned during training to optimize the model’s performance. Selecting and adjusting hyperparameters often involves experimentation and validation to achieve the best results for a particular task. The following outlines the most common hyperparameters in autoencoders:

  • Number of Hidden Layers: The quantity of hidden layers within the autoencoder defines its network depth and its capacity to capture intricate data patterns. This parameter is configured before training. While adding more hidden layers can enhance the model’s representational power, it may also introduce optimization challenges and elevate the risk of overfitting.

  • Number of Neurons in Each Layer: The number of neurons in each layer governs the network’s data representation capacity and is typically set before training. A higher count of neurons can amplify the network’s capacity but might also elevate the risk of overfitting and complicate the optimization process.

  • Size of Latent Space: Adjusting the size of the bottleneck layer permits fine-tuning the balance between model complexity and performance. This parameter is set prior to training.

  • Activation Function: The activation function utilized in the bottleneck layer plays a pivotal role in the autoencoder’s performance. To optimize the autoencoder’s performance, the bottleneck layer activation function should be tailored before training. These functions determine the network’s nonlinearity and its ability to learn intricate data patterns. Common activation functions employed in bottleneck layers encompass sigmoid, tanh, ReLU, and SELU. Further details, including their equations, outputs, and output curves, are outlined in Table 3.

  • Objective Function: The objective function, also known as the loss function, is a critical element of an autoencoder, serving to train the network by minimizing the distinction between input and output data. It gauges the dissimilarity between the input and output data, and the autoencoder is trained to diminish this dissimilarity. The selection of the objective function hinges on the data type and the specific application and is generally determined before training. Common objective functions used in autoencoders include:

    • Mean Squared Error (MSE): This is the predominant objective function in autoencoders, measuring the average squared difference between input and output data. MSE is defined by formulas (1):

      $$\begin{aligned} L_{\text {AE}}(X, X') = \min \left( \Vert X - X'\Vert _F^2 \right) \end{aligned}$$
      (4)
    • Binary Cross-Entropy(BCE): BCE employed when the input data is binary (0 or 1), this function measures the difference between predicted and actual output in terms of binary cross-entropy loss. Cross-entropy is defined in formulas (2):

      $$\begin{aligned} L_{\text {AE}}(X, X') = -\sum _{i=1}^{n} \left( x_i \log (x'_{i}) + (1 - x_i) \log (1 - x'_{i}) \right) \end{aligned}$$
      (5)

    When choosing an autoencoder loss function, consider the problem’s unique needs. MSE suits regression tasks, offering robustness against outliers but sensitivity to data scaling. BCE is for binary classification but can be numerically unstable near 0 or 1 probabilities. The choice depends on the problem and task requirements. MSE is the most prevalent autoencoder loss function, quantifying input–output discrepancies in the latent space.

  • Optimization Algorithm: Autoencoders utilize optimization algorithms to minimize the objective function during training. These algorithms adjust network weights and biases to train the autoencoder effectively. The choice of the optimization algorithm is made prior to training but may involve hyperparameter tuning during training. Several optimization techniques can be employed to train autoencoders, with the most notable ones being Stochastic Gradient Descent (SGD), Adam, and Adagrad. Further elaboration on each of these methods is provided below.

    • Stochastic Gradient Descent (SGD): A widely used algorithm that updates network parameters after processing small batches of data. While computationally efficient, it may converge slowly for complex models and datasets. Careful tuning of initial learning rates is often needed.

    • Adam: Combines features of SGD with adaptive learning rates and momentum to accelerate convergence and reduce the risk of getting stuck in local minima. Requires tuning of hyperparameters like beta1 and beta2 and is suitable for non-stationary and noisy objectives.

    • Adagrad: An adaptive algorithm adjusting learning rates based on parameter update frequency. Effective for sparse data, it can lead to quick convergence but may also converge prematurely and face challenges with non-convex optimization.

    The choice of optimization algorithm depends on dataset size, model complexity, loss function type, and computational resources. Each has its advantages and disadvantages, so selecting the right one is crucial for optimal performance.

  • Learning Rate: The learning rate, a hyperparameter, dictates the step size during optimization. It influences weight and bias updates and the convergence speed of the objective function. High values can cause overshooting, while low ones may lead to local minima trapping. The learning rate is preset but may be adjusted with schedules during training for better convergence.

  • Number of Epochs: Epochs are training iterations, representing full dataset passes. More epochs can enhance model accuracy but risk overfitting. The ideal count depends on dataset size and problem complexity. Learning rate initially set may require modification if convergence isn’t reached or early stopping is used to curb overfitting.

  • Batch Size: Batch size, in each optimization iteration, affects gradient noise and optimization efficiency. Smaller sizes yield noisier gradients but faster, memory-efficient optimization. Larger sizes offer stable gradients but slower, memory-intensive optimization. Batch size is determined beforehand but can be adjusted during training for optimization and memory use.

    These interconnected hyperparameters necessitate careful selection for optimal performance, often requiring experimentation despite the time investment, crucial for building an effective autoencoder model.

Table 3 Activation functions

3 Autoencoder taxonomy

Autoencoders, frequently employed in unsupervised learning, excel in dimensionality reduction tasks. They adeptly capture intricate, non-linear data relationships, enabling a hierarchical transformation of high-dimensional input into a lower-dimensional latent space. Autoencoders exhibit remarkable flexibility, allowing for customization across diverse data types and tasks by adjusting their architecture or objective functions. Over the past decade, a myriad of autoencoder variants has emerged, as illustrated in Fig. 4.

Autoencoder enhances feature discrimination through the incorporation of regularization techniques. Robust autoencoder aims to fortify the encoded data against noise or outliers, enhancing its ability to handle noisy or corrupted input data. The generative autoencoder specializes in learning a generative model using the extracted encoded representations, enabling the generation of new data samples closely resembling the distribution of the training data. Convolutional autoencoders replace fully connected layers with convolutional layers in both the encoder and decoder, making them particularly well-suited for image data by excelling at capturing spatial relationships within the data. Recurrent autoencoders leverage recurrent layers, such as LSTM or GRU, in both the encoder and decoder, proving invaluable for sequence data by capturing temporal dependencies within the information. Semi-supervised autoencoders harness the power of both labeled and unlabeled data to enhance model performance and generalization, demonstrating their value in scenarios with limited labeled data or resource constraints. Graph autoencoder leverages graph structures to learn data representations by processing graph-structured inputs and utilizing graph convolutional layers, allowing for the effective modeling of complex data dependencies. Masked autoencoders represent a straightforward autoencoding technique designed to reconstruct the original signal from its partially observed form.

The breadth of autoencoder models and their specialization options empowers fine-tuning for various applications. The adaptability of the autoencoder architecture and objective functions underscores their ability to be tailored to specific use cases, establishing them as indispensable tools for machine learning researchers and developers. In the following sections, we provide detailed explanations for each category.

Fig. 4
figure 4

Taxonomy of Autoencoder architectures categorized by network structure

3.1 Regularized autoencoder

Regularized Autoencoder (RAE) is a neural network architecture that extracts a compressed representation of input data while enforcing regularization constraints. These constraints encourage the formation of a discriminative, low-dimensional feature space. By incorporating different regularization techniques into the autoencoder, it becomes possible to create specialized models with desired properties, such as sparsity, manifold structure, or orthogonality.

3.1.1 Sparse autoencoder

Sparse Autoencoder (SAE) (Ng 2011) is characterized by having a limited number of simultaneously active neural nodes, as it aims to learn a sparse representation of input data by incorporating a sparsity constraint into the loss function. Its objective is to minimize the disparity between input data and reconstructed data while adhering to constraints on the sparsity of the latent representation. The loss function in a Sparse Autoencoder (SAE) comprises two components: the reconstruction loss and the sparsity loss, represented as follows:

$$\begin{aligned} {L_{\text {SAE}}(X, X') = \min \left( \Vert X - X'\Vert _F^2 + \lambda \text {KL}(p \parallel q) \right) } \end{aligned}$$
(6)

where \(\text {KL}(p \parallel q)\) calculates the Kullback–Leibler divergence between a target sparsity parameter (p) and the estimated average activation of each neuron (q) during training, defined as

$$\begin{aligned} {\sum p \log \left( \frac{p}{q} \right) + (1-p) \log \left( \frac{1-p}{1-q} \right) } \end{aligned}$$
(7)

This combined penalty term encourages the model to acquire a sparse representation, wherein only a limited number of neurons are active for each input.

3.1.2 Contractive autoencoder

Contractive Autoencoder (CAE) (Rifai et al. 2011) is an autoencoder that aims to produce similar representations for similar input data by adding a penalty term to the loss function. This penalty term, based on the Frobenius norm of the Jacobian matrix of the encoder concerning the input data, encourages local stability in the learned representation. The primary objective of the CAE is to minimize the difference between the input data and the reconstructed data while taking the penalty term into account, promoting similarity in representations for similar input data. The overall loss function of CAE includes the reconstruction loss and a penalty term as follows:

$$\begin{aligned} {L_{\text {CAE}}(X, X') = \min \left( \Vert X - X'\Vert _F^2 + \lambda \Vert J_F(X)\Vert _F^2 \right) } \end{aligned}$$
(8)

where \(\Vert J_F(X)\Vert _F^2\) represents the squared Frobenius norm of the Jacobian matrix of the encoded representation concerning the input data. This norm measures the sensitivity of the encoded representation to small variations, calculated as:

$$\begin{aligned} {\Vert J_F(X)\Vert _F^2 = \sum _{i,j} \left( \frac{\partial h_j(X)}{\partial X_i}\right) ^2} \end{aligned}$$
(9)

3.1.3 Laplacian autoencoder

The standard Autoencoder may not emphasize the relationships between nearby data points during its learning process, which can lead to extracted features lacking crucial information about the data’s internal structure. In contrast, the Laplacian Autoencoder prioritizes preserving the distances between neighboring data points, effectively capturing the significant internal structure within the data. Inspired by this concept, the Laplacian Autoencoder (LAE) (Jia et al. 2015) was introduced to facilitate the generation of lower-dimensional representations for Autoencoders. This approach ensures that the learned representations incorporate essential local structural information, enhancing their suitability for specific data analysis tasks. The loss function for the Laplacian Autoencoder is defined as follows:

$$\begin{aligned} {L_{LAE}(X,X') = \min \left( \Vert X-X'\Vert _F^2 + \lambda {\text {tr}}(Z' LZ) \right) } \end{aligned}$$
(10)

where matrix L, known as the graph Laplacian, is calculated based on how similar pairwise are in the latent space. This calculation typically involves techniques like using k-nearest neighbor graphs or Gaussian kernels.

3.1.4 Orthogonal autoencoder

Orthogonal Autoencoder (OAE) (Wang et al. 2019) is designed to enhance the orthogonality of learned embeddings, leading to more discriminative and diverse feature representations. Unlike the standard Autoencoder, OAE introduces a regularization term known as the orthogonal reconstruction error into the reconstruction loss function. This term promotes orthogonality among latent features, thereby improving class discriminability. The OAE loss function can be expressed as follows:

$$\begin{aligned} {L_{OAE}(X,X') = \min \left( \Vert X - X'\Vert _F^2 + \lambda \Vert Z^T Z - I\Vert _F^2 \right) } \end{aligned}$$
(11)

where \(I\) is the identity matrix, \(Z^T\) represents the transpose of the compressed representation \(Z\), and \(\lambda\) is a penalization parameter. Notably, setting \(\lambda\) to zero yields a conventional autoencoder.

3.2 Robust autoencoder

Robust Autoencoder (RAE) is utilized to enhance the robustness of autoencoders when dealing with noisy or corrupted input data. They prove especially valuable in situations where the input data exhibits noise, outliers, or imperfections. These issues are commonplace in real-world datasets, including those found in healthcare, finance, and sensor networks, where RAEs can effectively handle the data’s imperfections while retaining its valuable information. Three primary variants of robust autoencoders include Denoising Autoencoder, Marginalized Denoising Autoencoder, and \(L_{2,1}\) Autoencoder.

3.2.1 Denoising autoencoder

Denoising Autoencoder (DAE) (Vincent et al. 2010) is designed to reconstruct clean data from noisy input by introducing noise during training. The primary objective is to minimize the dissimilarity between the clean data and the reconstructed output. DAE training involves intentionally corrupting input data with various forms of noise and then minimizing the difference between the original clean input data and the reconstructed clean data. This process allows the DAE to discern valuable features within the input data while disregarding noise and irrelevant aspects. The DAE loss function is expressed as follows:

$$\begin{aligned} {L_{DAE}(X,X') = \min \left( \Vert X - \hat{X'}\Vert _F^2 \right) } \end{aligned}$$
(12)

where X represents the clean input data, and \(\hat{X'}\) denotes the noisy input data.

3.2.2 Marginalized denoising autoencoder

Marginalized Denoising Autoencoder (M-DAE) (Chen et al. 2012) is a specialized version of the Denoising Autoencoder (DAE) designed to handle datasets with missing or incomplete features. Like the standard DAE, the M-DAE is a neural network crafted to reconstruct clean input data from noisy versions. It achieves this by restoring clean data from corrupted counterparts, where input data X is intentionally subjected to random corruption. Each feature has a probability p of being set to 0, creating these corrupted versions referred to as \(\hat{X}_i\). The primary goal of the M-DAE is to minimize a specific loss function represented as:

$$\begin{aligned} {L_{\text {M-DAE}}(X,X') = \min \left( \frac{1}{m} \sum _{i=1}^{m} \Vert X - \hat{X'}W\Vert _F^2 \right) } \end{aligned}$$
(13)

where W signifies the learned transformation matrix, and m represents the total number of input examples.

The M-DAE seeks the best solution for W, which can be expressed mathematically as:

$$\begin{aligned} {W = E[Q^{-1}]E[P]} \end{aligned}$$
(14)

this equation calculates \(E[Q^{-1}]\) based on the inverse of the expected Q and E[P] using the expected P. These expectations are calculated using specific formulas involving the covariance matrix of the uncorrupted data X.

3.2.3 L2,1 robust autoencoder

\(L_{2,1}\) Robust Autoencoder (\(L_{2,1}\)-RAE) (Li et al. 2018) is a modified version of the Robust Autoencoder (RAE) designed to enhance the autoencoder’s resilience when dealing with noisy or corrupted input data. This enhancement is achieved through the use of a specific type of regularization known as L2,1 regularization. L2,1 regularization encourages the learned features to possess specific properties. Notably, it promotes feature sparsity, meaning that most features consist of zeros, and robustness, enabling them to handle scenarios with data outliers or noise. The mathematical expression of the L2,1-RAE loss function is given as follows:

$$\begin{aligned} L_{2,1RAE}(X,X') = \min \left( \Vert X - X'\Vert _F^2 + \lambda \cdot \Vert Z \Vert ^{2,1}\right) \end{aligned}$$
(15)

where \(\Vert Z \Vert _{2,1}\) represents the L2,1-norm of the latent representations, which emphasizes both sparsity and robustness in these learned features.

3.3 Generative autoencoder

Generative Aautoencoder (GAE) differs from traditional autoencoders by focusing on learning the underlying probability distribution of data rather than just dimensionality reduction. This enables GAE to generate new data samples that resemble the training data, making them valuable for tasks like image or text generation. Examples of GAE include Variational Autoencoders, Adversarial Autoencoders, Bayesian Autoencoder and Diffusion Autoencoder.

3.3.1 Variational autoencoder

Variational Autoencoder (VAE) (An and Cho 2015) is a type of autoencoder that learns to represent data in a lower-dimensional latent space and generate new data samples that resemble the input. Unlike traditional autoencoders, VAEs are generative models that can capture the underlying distribution of input data. In a VAE, the encoder maps input data to a posterior distribution \(q(Z\vert X)\) instead of a fixed latent representation Z. During reconstruction, Z is sampled from this distribution and passed through a decoder. The regularization loss in VAE encourages \(q(Z\vert X)\) to match a specific distribution, often a standard Gaussian. The VAE loss function is defined as:

$$\begin{aligned} L_{\text {VAE}} =&-E(q(Z\vert X)) \left[ \log [p(X\vert Z)] \right] + \text {KL}(q(Z\vert X)\vert \vert p(Z)) \end{aligned}$$
(16)

the first term measures the difference between the original input data (\(p(X\vert Z)\)) and the data reconstructed by the decoder. The second term, a regularization component, quantifies the KL divergence between \(q(Z\vert X)\) and p(Z), typically a standard Gaussian distribution. This loss function guides VAE training to balance accurate data reconstruction with a structured latent space for generative purposes.

3.3.2 Adversarial autoencoder

Adversarial Autoencoder (AAE) (Makhzani et al. 2015) is a specialized type of autoencoder designed to align its learned latent representations with a desired prior distribution. It consists of three main parts: an encoder, a decoder, and a discriminator. The encoder and decoder work together to create data that can deceive the discriminator, which is trained to distinguish between real input data and fake data produced by the decoder. The adversarial loss in AAE assesses its ability to generate data that resembles the original input data distribution. The discriminator aims to maximize its accuracy in telling real and generated data apart, while the decoder aims to minimize the discriminator’s accuracy. The overall loss function for AAE is expressed as:

$$\begin{aligned} L_{AAE}(X,X') = \min \left( \Vert X - X'\Vert _F^2 + \log (D(X)) + \log (1 - D(G(Z)))\right) \end{aligned}$$
(17)

where G(z) is the decoder function that converts the latent representation back to the original input data, and D(X) represents the discriminator’s output for the original input data. The term \(\log (1-D(G(Z)))\) reflects the discriminator’s output for data generated by the decoder.

3.3.3 Bayesian autoencoder

Bayesian Autoencoder (BAE) (Yong and Brintrup 2022) is a probabilistic AE that models all parameters, in contrast to the Variational Autoencoder (VAE) that mainly models the latent layer. BAE combines a Gaussian likelihood for data reconstruction with an isotropic Gaussian prior for parameter uncertainty. The loss function maximizes data likelihood and minimizes model complexity. The BAE loss function is defined as:

$$\begin{aligned} \log p(x\vert \theta ) = -\left( \frac{1}{D} \sum _{i=1}^{D} \frac{1}{2\sigma _i^2}(x_i - x'_i)^2 + \frac{1}{2} \log \sigma _i^2\right) \end{aligned}$$
(18)

where \(\sigma _i^2\) is the variance of the Gaussian distribution, and \(\log p(x\vert \theta )\) represents the log-likelihood of observing the original data x given the model parameters \(\theta\). It quantifies data reconstruction through squared errors and variances while promoting model simplicity. The training objective is to maximize this log-likelihood while minimizing regularization to find optimal parameters \(\theta\) for effective data pattern and uncertainty capture.

3.3.4 Diffusion autoencoder

Diffusion Autoencoder (DiffusionAE) (Preechakul et al. 2022) is a specialized type of autoencoder designed for generative modeling tasks. It draws inspiration from diffusion models and is engineered to capture intricate data distributions. In this framework, data is subjected to a progressive denoising process, allowing the model to grasp complex data patterns effectively. A fundamental element of DiffusionAE is its employment of a unique loss function known as the Diffusion Probabilistic Loss. This loss function guides the training by modeling how data evolves over time. Mathematically, the loss function is represented as:

$$\begin{aligned} L(X, X') = -\log P(X \vert X') \end{aligned}$$
(19)

in which \(P(X \vert X')\) signifies the conditional probability of observing the original data X when given the reconstructed data \(X'\). During training, the primary objective is to minimize this loss, driving the Diffusion Autoencoder to generate \(X'\) that closely resembles the original data X.

3.4 Convolutional autoencoder

Convolutional Autoencoder (CAE) (Seyfioğlu et al. 2018) employs convolutional layers instead of fully connected layers in both the encoder and decoder. The encoder uses these layers to create a compact representation from input images, while the decoder employs deconvolution layers for image reconstruction. CAEs are particularly effective for image data, as they excel at capturing spatial dependencies, which refer to the patterns and relationships among pixels or locations within individual images or data frames. They find wide-ranging applications in tasks such as image denoising, inpainting, segmentation, and super-resolution.

3.4.1 Convolutional variational autoencoder

convolutional variational autoencoder (CVAE) (Semeniuta et al. 2017) is a significant variant of Convolutional Autoencoders (CAEs) that incorporates probabilistic modeling, allowing for the generation of new data samples. In a CVAE, input images undergo a reconstruction process where a latent variable, denoted as Z, is sampled from a Gaussian distribution, and subsequently passed through a decoder. This decoder employs convolutional and upsampling layers to reconstruct the original image. The loss function in CVAE, similar to Variational Autoencoder (VAE), is defined as:

$$\begin{aligned} L_{\text {CVAE}} =&-E(q(Z\vert X)) \left[ \log [p(X\vert Z)] \right] + \text {KL}(q(Z\vert X)\vert \vert p(Z)) \end{aligned}$$
(20)

the first term measures the difference between the original image and its reconstruction by the decoder, while the second term encourages the latent representation \(q(Z\vert X)\) to follow a standard Gaussian distribution through KL divergence regularization, ensuring a structured latent space for effective generative capabilities.

3.4.2 Convolutional LSTM autoencoder

Convolutional LSTM (ConvLSTM) (Luo et al. 2017) is an advanced neural network architecture specialized in spatiotemporal data analysis. It combines convolutional structures with recurrent operations, allowing it to capture both spatial dependencies and temporal relationships in data. This design employs 3D tensors, with the last two dimensions representing spatial dimensions (e.g., rows and columns). ConvLSTM excels in tasks involving both spatial and temporal patterns, such as precipitation nowcasting and video analysis. It utilizes a unique loss function to make predictions based on neighboring cells, consistently outperforming traditional RNNs and contemporary algorithms in various spatiotemporal forecasting applications. The overall loss function for a ConvLSTM can be defined as follows:

$$\begin{aligned} {L_{\text {ConvLSTM}}(X,X') = \min \sum _{i=1}^N \sum _{j=1}^M \sum _{t=1}^T \left( \Vert X_{ijt} - X'_{ijt}\Vert _F^2 \right) } \end{aligned}$$
(21)

where N is the number of spatial rows in the data, M is the number of spatial columns in the data, T is the number of time steps in the sequence, \(X_{tij}\) represents the ground truth value at spatial location (ij) at time step t, and \(X'_{tij}\) represents the predicted value at spatial location (ij) at time step t.

3.4.3 Convolutional sparse autoencoder

Convolutional Sparse Autoencoder (CSAE) (Luo et al. 2017) is a neural network architecture that combines convolutional autoencoder principles with techniques to induce sparsity, such as max-pooling and feature channel competition. This integration simplifies the training process by eliminating the need for complex optimization procedures. CSAE includes a sparsifying module designed to create sparse feature maps. This module retains the highest value and its corresponding position within each local subregion before performing unpooling, primarily through max pooling. The loss function used in CSAE, which quantifies the disparities between the original input and the reconstructed output, relies on the Frobenius norm and is defined as follows:

$$\begin{aligned}{} & {} {L_{\text {CSAE}}(X,X') = \min \sum _{l=1}^L \left( \Vert X^{(l)} - {X'}^{(l)} \right\| _F^2)} \end{aligned}$$
(22)
$$\begin{aligned}{} & {} {{X'}^{(l)} = \sum _{i=1}^d \left( \text {rot}(W_i, 180) *Z_i^{l} \right) + c_i} \end{aligned}$$
(23)
$$\begin{aligned}{} & {} {Z^l = G_{p,s}\left( Z_i^{(l)} \right) = G_{p,s}\left( f(W_i \cdot X^{(l)} + b_i) \right) } \end{aligned}$$
(24)

where l is the number of layers, \(X^{(l)}\) represents the original input at layer l, \({X'}^{(l)}\) represents the reconstructed output at layer l, d is the number of feature channels, \(Z^l_i\) is the ith sparsified feature map, and \(G_{p,s}(X)\) represents the sparsifying operator, involving max-pooling and unpooling operations to create sparse feature maps.

3.5 Recurrent autoencoder

RNNs (Medsker and Jain 2001) are designed for processing sequential data, like time series where the current state (\(h^t\)) relies on the previous state (\(h^{t-1}\)). Vanilla RNNs have a limitation of short-term memory, leading to gradient problems in long sequences. To address this, LSTM equipped with three gates (forget gate, input gate, and output gate), and GRU networks consist of two gates (update gate and reset gate) were introduced. These architectures incorporate self-loops to effectively manage gradients over extended sequences, addressing the vanishing or exploding gradient issue. Recurrent Autoencoder is an autoencoder that incorporates recurrent layers, such as LSTM or GRU, within both the encoder and decoder components.

3.5.1 Long short term memory autoencoder

LSTM Autoencoder (LSTMAE) (Nguyen et al. 2021) is an advanced variation of the recurrent autoencoder, specifically designed to capture representations from sequential data. In this architecture, both the encoder and decoder components are built using LSTM units, a type of recurrent layer. The encoder LSTM takes in a sequence of vectors, which can represent images or features. In contrast, the decoder LSTM reconstructs the original input sequence, often in reverse order. The MSE loss function computes the average squared differences between the input and the reconstructed output at each time step. The formula for MSE loss is as follows:

$$\begin{aligned} {L_{\text {LSTMAE}}(X,X')= \min \left( \Vert X - X'\Vert _F^2 \right) } \end{aligned}$$
(25)

where X represents the clean input sequence and \(X'\) represents the reconstructed output sequence.

3.5.2 Gated recurrent unit autoencoder

GRU Autoencoder (GRUAE) (Dehghan et al. 2014) employs GRU units in both the encoder and decoder parts. Unlike LSTM, GRU has a simpler architecture with only two gates: the update and reset gates. This architectural simplicity can lead to easier training and faster processing while still capturing long-term dependencies in input sequences. The formulation of a GRU Autoencoder is similar to that of an LSTM Autoencoder, making it flexible and effective for modeling sequential data,

$$\begin{aligned} {L_{\text {GRUAE}}(X,X') = \min \left( \Vert X - X'\Vert _F^2 \right) } \end{aligned}$$
(26)

where X represents the clean input sequence and \(X'\) represents the reconstructed output sequence.

3.5.3 Bidirectional autoencoder

Bidirectional Autoencoder (BiRNNAE) (Marchi et al. 2015) is a neural network designed for unsupervised learning from sequential data. It utilizes bidirectional RNNs like LSTM or GRU in both the encoder and decoder parts. While traditional RNNs only consider information in one direction, bidirectional RNNs incorporate knowledge from both forward and backward directions, improving their grasp of temporal relationships. The BiRNN-AE aims to minimize the squared reconstruction error between the original input sequence and the generated sequence during training. To represent the input data efficiently, it combines the final hidden states from all encoder layers. This compact representation can be valuable for various downstream tasks involving sequential data. The loss function for BiRNN-AE is the Mean Squared Error (MSE) loss, which can be mathematically expressed as:

$$\begin{aligned} {L_{\text {BiRNNAE}}(X,X')= \min \frac{1}{T} \sum _{t=1}^T\left( \Vert X_{t} - X'_{t}\Vert _F^2 \right) } \end{aligned}$$
(27)

where T is the sequence length, \(X_t\) represents the input at time step t, and \(X'_t\) represents the reconstructed output at time step t.

3.6 Semi-supervised autoencoder

Semi-supervised Autoencoders (SSAE) is autoencoder model that utilize both labeled and unlabeled data to enhance feature learning, especially in scenarios with limited labeled data. The primary objective of SSAE is to leverage the available labeled data to facilitate the extraction of crucial latent features, which can subsequently be applied to tasks such as clustering or classification (Yang et al. 2022). This approach proves highly advantageous when dealing with a scarcity of labeled data, as it enables the exploitation of abundant unlabeled data, a common occurrence in real-world applications. In the following section, we delve into an explanation of the three methods of semi-supervised Autoencoder.

3.6.1 Semi-supervised variational autoencoder

Semi-supervised Variational Autoencoder (SSVAE) (Xu et al. 2017) is a category of generative models employed in semi-supervised learning scenarios. In SSVAE, the encoder responsible for generating the latent variable, denoted as z, is defined as \(q_{\phi }(z\vert x, y)\). This implies that the latent variable z is parameterized by both input data x and label y. The decoder, on the other hand, generates samples from the distribution \(p_{\theta }(x\vert y, z)\). The label predictive distribution \(q_{\phi }(y\vert x)\) is determined by a classification network. Notably, the label y is also considered a latent variable and plays a role in generating a sample x in conjunction with z. The loss function for SSVAE is mathematically expressed as follows:

$$\begin{aligned} \begin{aligned} L_{\text {SSVAE}} =&-\mathbb {E}_{q_{\phi }(z\vert x,y)}[\log p_{\theta }(x\vert y, z)] \\&- \log p_{\theta }(y) + \text {KL}(q_{\phi }(z\vert x, y) || p(z)) \end{aligned} \end{aligned}$$
(28)

in which the first term represents the expectation of the conditional log-likelihood of the latent variable z, the second term denotes the log-likelihood associated with y, and the third term quantifies the Kullback–Leibler divergence between the prior distribution p(z) and the posterior distribution \(q_{\phi }(z\vert x, y)\).

3.6.2 Disentangled variational autoencoder

Disentangled Variational Autoencoder (DVAE) (Higgins et al. 2016) is a sophisticated generative model designed to untangle complex data representations. By incorporating specific graphical model structures and distinct encoding factors, it can effectively separate and capture meaningful information. This model leverages neural networks within a graphical framework to capture relationships among observed and unobserved variables. To optimize its performance, it employs a conditional probability factorization, \(q(y, z \vert x)\), which is different from traditional approaches. This change requires advanced variational inference methods. In essence, Disentangled VAEs are adept at modeling intricate data patterns, making them valuable for various machine learning tasks. Mathematically, they use a loss function expressed as:

$$\begin{aligned} \begin{aligned} \mathbb {E}_{q(y, z \vert x)} \left( \log p(x \vert y, z) + \log p(y) + \log p(z) - \log q(y \vert x, z) - \log q(z \vert x) \right) \end{aligned} \end{aligned}$$
(29)

In simpler terms, this loss function guides the model to generate data resembling real-world data while considering the relationships between observed and latent variables.

3.6.3 Label and sparse regularized autoencoder

Label and Sparse Regularized Autoencoder (LSRAE) (Chai et al. 2019) is a novel approach that combines label and sparse regularizations with autoencoders to create a semi-supervised learning method. This method effectively leverages the strengths of both unsupervised and supervised learning processes. On one hand, sparse regularization selectively activates a subset of neurons, enhancing the extraction of localized and informative features. This unsupervised learning process helps uncover underlying data concepts, improving generalization. On the other hand, label regularization enforces the extraction of features aligned with category rules, leading to improved categorization accuracy. The objective function of LSRAE is defined as follows:

$$\begin{aligned} L_{\text {LSRAE}}(X,X') = \min \left( \Vert X - X'\Vert _F^2 + {KL}(p \parallel q) + \sum _{i=1}^d \sum _{j=1}^l (W_{ij})^2 + \sum _{i=1}^n \left\| L - T \right\| \right) \end{aligned}$$
(30)

where the first term ensures precise data reconstruction, the second term promotes sparsity within the hidden layer, facilitating efficient feature extraction. The third term acts as a safeguard against overfitting by penalizing excessive weights. Lastly, the fourth term enhances classification accuracy by quantifying the label error. Here, L denotes the actual label, and T represents the desired label.

3.7 Graph autoencoder

Graph Autoencoder (GAE) (Pan et al. 2018) is a power method for reducing the dimensionality of graph data, enhancing efficiency in graph analytics. It takes a graph as input and outputs a condensed vector representation that captures its essential feature. Within GAE, the encoder converts the input graph into a lower-dimensional vector, which the decoder uses to recreate the original graph. The model aims to minimize the dissimilarity between input and output graphs while capturing essential graph features. The loss function for GAE is defined as:

$$\begin{aligned} {L_{\text {GAE}}(X,X') = \min \left( \Vert X - X'\Vert _F^2 \right) } \end{aligned}$$
(31)

where \(X'\) is computed from the inner product of the hidden representation \(Z\) and its transpose \(Z^T\) using the logistic sigmoid function \(\sigma (ZZ^T)\). \(Z = GCN(F,X)\), obtained through the Graph Convolutional Network (GCN) applied to the node features matrix \(F\), is based on the input data \(X\).

3.7.1 Variational graph autoencoder

Variational Graph Autoencoder (VGAE) (Kipf and Welling 2016) is a framework for learning interpretable latent representations of graph-structured data. It employs a probabilistic approach to encode graph information effectively. VGAE consists of two essential components: an encoder and a decoder. The encoder utilizes a Graph Convolution Network (GCN) to transform graph nodes into a lower-dimensional latent space. It generates latent variables \(z_i\) for each node by sampling from Gaussian distributions. These latent variables capture crucial structural information of the graph. The decoder functions as a generative model, aiming to reconstruct the original graph structure using the latent variables \(z_i\). It estimates the likelihood of connections (edges) between nodes based on their corresponding latent vectors.The VGAE loss function combines a reconstruction term and a regularization term to guide the learning process effectively:

$$\begin{aligned} {L_{\text {VGAE}} = -E(q(Z\vert F,X)) [\log [p(X\vert Z)]] + \text {KL}(q(Z\vert F,X)\vert \vert p(Z))} \end{aligned}$$
(32)

where \(q(Z\vert F,X)\) represents the encoding distribution, \(p(X\vert Z)\) models the likelihood of the adjacency matrix given the latent variables, and \(\text {KL}(q(Z\vert F,X)\vert \vert p(Z))\) quantifies the divergence between the encoding distribution and the prior distribution governing the latent variables Z.

3.7.2 Adversarial graph autoencoder

Adversarial Graph Autoencoder (AGAE) (Pan et al. 2018) leverages adversarial training to acquire a lower-dimensional representation of the input graph. It employs an encoder to map graph nodes to this lower-dimensional space and a decoder to reconstruct the original graph. AGAE integrates an adversarial component, akin to a discriminator, to ensure the learned embeddings preserve the graph structure. This unsupervised model combines autoencoder-based reconstruction with adversarial training to generate high-quality graph representations. The AGAE loss function is defined as follows:

$$\begin{aligned} {L_{\text {AGAE}} = E_{(H \sim p_z)} [\log D(Z)] + E_X [\log (1 - D(G(F,X)))]} \end{aligned}$$
(33)

where \(G(\cdot )\) represents the generator, and \(D(\cdot )\) signifies the discriminator. The discriminator’s role is to distinguish between the real input graph, \(p_z\), and the reconstructed graph generated by the generator G(FX).

3.7.3 Graph attention autoencoder

Graph Attentional Autoencoder (GAAE) (Salehi and Davulcu 2019) is a variant of graph autoencoders that combines Graph Attention Network (GAT) with GAE. It employs attention mechanisms to weigh the importance of neighboring nodes and edges during the reconstruction process. In essence, GAAE aims to learn a low-dimensional representation of a graph while preserving its structural information using attention mechanisms. The GAAE loss function is defined as follows:

$$\begin{aligned} {L_{\text {GAAE}} = \min \left( \Vert X - \text {Sigmoid}(ZZ^T))\Vert _F^2 \right) } \end{aligned}$$
(34)

in which Z represents the hidden layer representation of node v. The calculation of \(Z_i^{(l)}\) is based on the formula:

$$\begin{aligned} {Z_i^{(l)} = \sigma \left( \sum _{j \in N_i} a_{ij} W^{(l-1)} Z_j^{(l-1)} \right) } \end{aligned}$$
(35)

where \(N_i\) denotes the set of neighbors of node \(v_i\), and \(W^{(l-1)}\) represents the learnable parameter matrix. The attention coefficient \(a_{ij}\) is computed using the following formula:

$$\begin{aligned} {a_{ij} = \frac{\exp (\delta M_{ij} (\textbf{a}^T [W\textbf{x}_{i} \Vert W\textbf{x}_{j}]))}{\sum _{r \in N_i} \exp (\delta M_{ir} \textbf{a}^T ([W\textbf{x}_{i} \Vert W\textbf{x}_{r}]))}} \end{aligned}$$
(36)

where M represents topological weights, and \(\delta\) is the LeakyReLU activation function.

3.8 Masked autoencoders

Masked AE (MAE) is a variant of autoencoder used for sequence modeling, particularly in vision and NLP. It operates by taking a sequence of data and randomly masking or hiding some of the elements. The model’s task is to predict the masked or missing elements based on the context provided by the unmasked portions. This training approach enables MAE to generate coherent and contextually appropriate text or videos, making them valuable for tasks like text completion (Zhang et al. 2022), text generation (Zhang et al. 2023,) language modeling, image captioning (Alzu’bi et al. 2021) and data augmentation (Xu et al. 2022).

3.8.1 Graph masked autoencoder

Graph Masked Autoencoder (GMAE) (Hou et al. 2022) is a simplified and cost-effective approach for self-supervised graph representation learning. Unlike most GAEs that focus on reconstructing graph structures, GMAE’s core emphasis is on feature reconstruction through masking. Additionally, GMAE departs from using MSE, opting for the cosine error, which benefits cases where feature magnitudes vary, common in graph node attributes. The primary objective of GMAE is to reconstruct the masked features of nodes, \(V' \subset V\), given the partially observed node signals. Formally, for GMAE, the Loss function is as follow, where it is averaged over all masked nodes,

$$\begin{aligned} L_{\text {GMAE}} = \min \frac{1}{|V'|}{\sum _{{v}_i \in V'} \left( 1 - \frac{{x}_i^T {z}_i}{\Vert {x}_i\Vert \cdot \Vert {z}_i\Vert }\right) ^\gamma }, \quad \gamma \ge 1 \end{aligned}$$
(37)

3.8.2 Contrastive masked autoencoder

Contrastive Masked Autoencoders (CMAE) (Huang et al. 2022) is a novel self-supervised pre-training method designed to enhance the learning of comprehensive and versatile vision representations. CMAE comprises two distinct branches: the online branch, characterized by an asymmetric encoder-decoder configuration, and the target branch, featuring a momentum-updated encoder. During the training process, the online encoder is tasked with reconstructing original images from latent representations of masked images with positional embeddings added. The loss uses cosine similarity \(\rho\) between (\({y}_s^p\) and \({z}_t^p\)) and negative \({\rho }_j^-\) pairs. The final objective function is as follow,

$$\begin{aligned} L_{\text {CMAE}}= \min \left( \Vert {Y}_m - {Y}_m'\Vert _F^2 + \lambda \log \frac{\left( -\exp (\frac{\rho _-^j}{\tau })\right) }{\exp (\frac{\rho _j^-}{\tau }) + \sum _{j=1}^{K} \exp (\frac{\rho _j^-}{\tau })} \right) \end{aligned}$$
(38)

3.8.3 Self-distillated masked autoencoder

Self-Distilled Masked AutoEncoder (SDMAE) (Chen et al. 2022) is composed of two branches: a student branch equipped tasked with reconstructing missing information, and a teacher branch responsible for generating latent representations of masked tokens. In this approach, a student network \(f_\theta\) trained through gradient descent using \(\hat{x}\) as inputs and a teacher network \(f_\phi\). Based on the MAE method, a value normalization function is proposed for the teacher outputs as \(\overline{f_\phi (x_i)}\). This function calculates the mean and standard deviation of feature values within a patch. Subsequently, the optimization objective involves minimizing the normalized teacher features with the output features of the student decoder, utilizing feature cosine similarity as follow,

$$\begin{aligned} L_{\text {SDMAE}}=\min (\log q_{\psi }(\hat{x}\vert \tilde{x}) )\approx \min \frac{\sum _{i=1}^{n} m^i \overline{f_\phi (x_i)} f_\theta (\hat{x})}{\sqrt{\sum _{i=1}^{n} m^i (\overline{f_\phi (x_i)})^2} \sqrt{\sum _{i=1}^{n} m^i (f_\theta (\hat{x}))^2}} \end{aligned}$$
(39)

Table 4 presents a comprehensive summary of different autoencoder methods, offering insights into the specific enhancements each method brings to the table as well as the loss functions they employ for optimization.

Table 4 Various autoencoder methods including details on their respective improvements and utilized loss functions

4 Application autoencoder

AEs have been widely used in various domains, including computer vision, natural language processing, complex network analysis, recommenders, anomaly detection, speech recognition, and more. Different types of autoencoder architectures have been proposed to address specific challenges and improve performance in these domains. For example, convolutional autoencoders are commonly used in image processing tasks, while recurrent autoencoders are well-suited for sequential data processing. In addition, variational autoencoders have been developed for generating new data samples and improving model generalization. Although each architecture has its own advantages and limitations, it is important to consider the specific requirements of the application domain when selecting an appropriate architecture. Figure 5 provides an overview of the applications of autoencoders in various domains, which can be used as a starting point for selecting an appropriate architecture. However, further research is needed to investigate which architectures are more suitable for which application categories and which architectures are more popular in specific domains.

Fig. 5
figure 5

The process of creating the consensus matrix, including the generation of random walks of different lengths and their combination

4.1 Machine vision

Machine vision utilizes computer algorithms and software to analyze and interpret images or video data, aiming to enable machines to understand and interact with the visual world (Jain et al. 1995). AEs play a vital role in various machine vision applications by learning to extract meaningful image features and reducing data dimensionality. These applications encompass tasks such as image classification (Vincent et al. 2010), image clustering (Guo et al. 2017), image segmentation (Myronenko 2019), image inpainting (Bertalmio et al. 2000), image generation (Vahdat and Kautz 2020), object detection (Liang et al. 2018), and 3D shape analysis (Todd 2004).

AEs are instrumental in image classification. Methods like Semi-supervised stacked distance autoencoder (Hou et al. 2020) enhance feature representation by incorporating semi-supervised learning, utilizing both labeled and unlabeled data to learn inter-data point distances. Deep Convolutional Autoencoders (DCAE) aid in semi-supervised classification, as seen in Geng et al. (2015), where they pre-train on unlabeled Synthetic Aperture Radar (SAR) images and fine-tune using labeled data for high-resolution SAR images classification.

AEs are also valuable in image clustering, where they learn compressed image representations for grouping similar images in the latent space. This technique involves training a clustering algorithm like K-means on the latent space, as described in references Song et al. (2013) and Yang et al. (2017). Additionally, AEs can be used for unsupervised image clustering, making them suitable for scenarios with limited labeled data.

AEs are instrumental in image segmentation, with a wide array of applications that enhance the precision and efficiency of this critical computer vision task. By learning meaningful feature representations from image data, AEs provide a valuable foundation for distinguishing objects and boundaries in images. Their capability for dimensionality reduction streamlines the processing of high-resolution images, making segmentation algorithms computationally more tractable (Zhang et al. 2019). AEs also excel in noise reduction, eliminating unwanted artifacts from images, which is pivotal for accurate segmentation (Tripathi 2021). They are integral in semantic segmentation (Ohgushi et al. 2020), where they classify each pixel in an image, and instance segmentation (Lin et al. 2020), distinguishing individual object instances. Furthermore, AEs contribute to medical image segmentation (Ma et al. 2022), aiding in the precise identification of structures and anomalies in healthcare images. Overall, AEs substantially elevate the accuracy and efficiency of image segmentation tasks, encompassing a range of applications that extend from object recognition to medical diagnosis.

AEs find significant applications in the domain of image inpainting, a process of reconstructing missing or corrupted parts of an image. They excel at capturing complex patterns and textures within images, making them invaluable for this task. AEs, particularly VAEs and GANs, offer high-quality inpainting results by learning to generate realistic and coherent content to fill in the gaps (Tian et al. 2023; Han and Wang 2021). They effectively model the underlying structures and features of images, ensuring that the inpainted regions seamlessly blend with the surrounding content.

AEs find versatile applications in image generation tasks, contributing to the creation of high-quality and diverse visual content. They serve as a foundational component in generative models, VAEs and GANs, enabling the synthesis of realistic and novel images (Huang and Jafari 2023). AEs are essential in encoding and decoding operations, effectively generating images with specific features, styles, and content (Xu et al. 2019). They also play a vital role in style transfer, where they transform images to adopt the artistic characteristics of other images or styles (Kim et al. 2021).

AEs play a role in object detection by extracting valuable features from images or video frames, improving detection accuracy. Convolutional AEs are used to learn compressed image representations that enhance the performance of object detection algorithms, such as Region-based Convolutional Neural Networks (R-CNN) (Ding et al. 2019). VAE further enhanes object detection accuracy, as seen in the integration of VAE with You Only Look Once (YOLO) (Redmon et al. 2016).

In the domain of 3D shape analysis, AEs learn compressed representations for tasks like shape generation, completion, and retrieval. Achieving a disentangled latent representation that separates various factors of variation is a challenge. Recent research introduces methods like Split-AE (Saha et al. 2022) and 3D Shape Variational Autoencoder Latent Disentanglement (Foti et al. 2022), addressing this challenge. Other approaches employ deep learning features for 3D shape retrieval by projecting 3D shapes into 2D space and utilizing AEs for feature learning (Zhu et al. 2016). Additionally, architectures like point-cloud AEs combined with VAEs are explored to partition the latent space and enhance 3D shape analysis (Aumentado-Armstrong et al. 2019).

While AEs offer valuable capabilities in various machine vision applications, their effectiveness often depends on the specific task and dataset characteristics, and they may be complemented by specialized models in certain scenarios.

4.2 NLP

NLP is a field that explores how computers can understand and work with human language in speech or text form to perform useful tasks (Chowdhary and Chowdhary 2020). This area mainly concentrates on methods for handling text data, including tasks like categorizing text (text classification) (Kowsari et al. 2019), grouping similar texts together (text clustering) (Aggarwal and Zhai 2012), generating new text (text generation) (McKeown 1992), and assessing the sentiment expressed in text (sentiment analysis) (Medhat et al. 2014). To tackle the complexities of working with textual data, researchers have developed advanced models, often incorporating AEs. These models have proven effective in addressing the challenges associated with processing text data (Li et al. 2023).

AEs play a versatile role in text classification tasks, offering feature learning to capture crucial patterns in text data (Guo et al. 2023; Ye et al. 2022), dimensionality reduction for efficient processing of high-dimensional text features (Le et al. 2023; Che et al. 2020), noise reduction to clean and enhance noisy text (García-Mendoza et al. 2022; Che et al. 2020), and semi-supervised learning for improved classification using limited labeled data (Wu et al. 2019; Xu et al. 2017). They also excel in topic modeling by uncovering underlying themes within text documents (Paul et al. 2023; Smatana and Butka 2019), aid in anomaly detection to identify unusual patterns (Gorokhov et al. 2023; Bursic et al. 2019), and enable coherent text generation (Semeniuta et al. 2017; Zhao et al. 2021). Their adaptability and versatility make them indispensable tools in NLP and text analysis, enhancing various aspects of text classification. Another application of AE in the field of NLP is text clustering. In this context, AEs have been applied to organize text documents into meaningful groups. One approach utilizes stacked AEs, combining them with k-means clustering to effectively group text documents into meaningful clusters (Hosseini and Varzaneh 2022). In Deep Embedded Clustering (DEC), AEs play a pivotal role by initializing feature representations of data points and serving as the foundation for similarity computations during the clustering process. The embeddings learned by AEs are jointly optimized with cluster assignments, thereby enhancing the overall quality of clustering results (Xie et al. 2016; Daneshfar et al. 2023). AEs also provide a solution to the challenges of short text clustering. They address the sparsity problem in short text representations by employing low-dimensional continuous representations or embeddings like Smooth Inverse Frequency (SIF) embeddings. Here, the encoder maps the input short texts to a lower-dimensional continuous representation, and the decoder strives to reconstruct the input from this representation. AEs are used to encode and reconstruct these SIF embeddings, resulting in improved short text clustering quality (Hadifar et al. 2019).

4.3 Complex network

Autoencoders have emerged as valuable tools in complex network analysis, playing a pivotal role in transforming and enhancing network data for various tasks, including network embedding (Cui et al. 2018), deep clustering (Berahmand et al. 2023), and link prediction (Martínez et al. 2016). These applications harness the capability of autoencoders to capture complex, non-linear relationships within network data, enabling more effective and insightful analyses.

Network embedding involves learning compact representations of nodes and edges in a network. Autoencoders excel in this task by seeking optimal non-linear functions to preserve intricate graph structures. For instance, the Structural Deep Network Embedding (SDNE) method (Wang et al. 2016) employs a deep autoencoder approach to address challenges such as high non-linearity, structure preservation, and sparsity. It utilizes multiple non-linear layers to preserve neighbor structures of nodes, enhancing the depth of representation learning. Another method, DNGR (Cao et al. 2016), captures both the weighted graph structure and nodes’ non-linear characteristics by employing a random surfing model inspired by PageRank. This approach constructs node representations through a weighted transition probability matrix and employs stacked denoising autoencoders for latent representation learning. Additionally, the adversarial framework ARGA (Pan et al. 2018) aims to balance graph structure reconstruction and enforcing latent code adherence to a prior distribution, producing robust graph representations.

Deep clustering focuses on dividing a network into meaningful clusters of nodes with similar attributes or behaviors. The Marginalized Graph Autoencoder (MGAE) augments autoencoder-based representation learning with GCN to achieve deep node representations (Wang et al. 2017). Shaohua Fan et al. introduce the One2Multi graph autoencoder (Fan et al. 2020), which learns node embeddings by reconstructing multiple graph views using one informative graph view and content data. This approach effectively captures shared feature representations and optimizes cluster label assignments and embeddings through self-training and autoencoder-based reconstruction. In contrast, the N2D method (McConville et al. 2021) simplifies deep clustering by replacing the clustering network with an alternative framework, reducing the complexity of typical deep clustering algorithms.

Link prediction aims to predict missing or future connections in a network based on observed data. In this context, the Heterogeneous Hypergraph Variational Autoencoder (HeteHG-VAE) transforms Heterogeneous Information Networks (HINs) into heterogeneous hypergraphs, capturing both high-order semantics and complex relationships while preserving pairwise topology (Fan et al. 2021). Bayesian deep generative frameworks are used to learn deep latent representations, improving link prediction in HINs. Another method (Salha et al. 2019) inspired by Newtonian gravity extends the graph autoencoder and VAE frameworks to address link prediction in directed graphs, effectively reconstructing directed graphs from node embeddings. Lastly, the Multi-Scale Variational Graph Autoencoder (MSVGAE) introduces a novel graph embedding framework that leverages graph attribute information through self-supervised learning (Guo et al. 2022).

In conclusion, autoencoders are versatile tools for intricate network analysis, contributing significantly to tasks such as network embedding, deep clustering, and link prediction by capturing complex patterns, enhancing representations, and enabling precise predictions.

4.4 Recommender system

Autoencoders find valuable applications in recommendation systems, which aim to suggest items to users based on their historical behavior or preferences. Recommender systems play a pivotal role in various domains, including e-commerce, social media, and online content platforms, offering personalized recommendations to users (Zhang et al. 2019). However, traditional recommender systems grapple with the challenges posed by the immense volume, complexity, and dynamic nature of information (Zhang et al. 2020).

The concept behind autoencoder-based recommender systems involves using AEs to acquire a lower-dimensional representation of both items and users. This representation can subsequently predict a user’s preferences for items they haven’t yet interacted with. Autoencoder-based recommender systems fall into two categories: pure autoencoder models and integrated autoencoder models, depending on the model architecture employed (Zhang et al. 2020).

In pure autoencoder models, the autoencoder serves as the sole architecture for recommendation. These models rely exclusively on user-item interaction data and/or item features to learn a compressed representation of the data, enabling personalized recommendations. Examples of pure autoencoder models include the Collaborative Denoising Autoencoder (CDAE) (Wu et al. 2016) and Deep Content-based Autoencoder (DCAE) (Van den Oord et al. 2013). CDAE is tailored for collaborative filtering data, where user-item interactions form a sparse matrix. It learns low-dimensional representations of users and items by reconstructing missing entries in the matrix. In contrast, DCAE handles content-based data, representing items as feature vectors. This model learns low-dimensional representations of items by reconstructing the original feature vectors (Wang et al. 2015). Additional examples include Collaborative Filtering Neural Network (CFN) (Strub et al. 2016, 2015), Hybrid Collaborative Recommendation via Semi-Autoencoder (HCRSAE) (Zhang et al. 2017), and Imputation-boosted Denoising Autoencoder (IDAE) (Lee and Lee 2017). Each model has its specific strengths and limitations, rendering them suitable for distinct recommendation scenarios.

In integrated autoencoder models, the autoencoder collaborates with other recommendation models, such as matrix factorization or neural network-based models, to enhance recommendation accuracy. These models use the autoencoder to learn a compressed representation of the data, which is then integrated with other models to generate recommendations (Strub et al. 2016). Examples of integrated autoencoder models include the Hybrid Collaborative Content-based Autoencoder (HCCAE) (Zhang et al. 2017), Variational Autoencoders for Collaborative Filtering (VAE-CFs) (Liang et al. 2018), and Neural Collaborative Autoencoder (NCAE) (He et al. 2017). HCCAE combines the learned representations with other recommendation models, while NCAE utilizes a neural network to generate recommendations directly from the learned representations. These models leverage additional information such as content features, social relationships, or visual data to enhance their recommendations. Each model possesses unique characteristics and objectives, making them suitable for addressing various challenges like cold start problems, sequential data, semantic information, or visual styles.

4.5 Anomaly detection

While AEs have the ability to learn complex patterns in data and detect anomalies that are not easily identifiable, it has been widely used in the field of anomaly detection (Pang et al. 2021). An anomaly detection model can be used to detect a fraudulent transaction or any highly imbalanced supervised tasks (Chandola et al. 2009). AEs can be used in supervised (Alsadhan 2023), unsupervised (Lopes et al. 2022), and semi-supervised (Akcay et al. 2018; Ruff et al. 2019) anomaly detection tasks.

In supervised anomaly detection, AEs are trained on both normal and anomalous data. The AE is first trained on normal data to learn the underlying patterns and features of normal data. Then, the AE is fine-tuned on the combined normal and anomalous data to capture the difference between normal and anomalous data. During training, the objective is to minimize the reconstruction error between the input and the output of the AE. After training, the reconstruction error of the test data is compared to a threshold. If the reconstruction error is above the threshold, the input data is classified as anomalous (Pang et al. 2021). This approach combines the feature learning capabilities of AEs with the discriminative power of supervised classifiers, enhancing the accuracy of anomaly detection in real-world applications, including fraud detection (Alsadhan 2023; Debener et al. 2023; Fanai and Abbasimehr 2023), network security (Ghorbani and Fakhrahmad 2022; Lopes et al. 2022), and fault detection (Ding et al. 2022; Ying et al. 2023) in industrial processes.

In unsupervised tasks, the idea is to train AEs on only sample data of one class (majority class). This way the network is capable of re-constructing the input with good or less reconstruction loss. Now, if a sample data of another target class is passed through the AE network, it results in comparatively larger reconstruction loss, a threshold value of reconstruction loss (anomaly score) can be decided, larger than that can be considered an anomaly (Sakurada and Yairi 2014). This inherent ability to capture complex data representations without labeled anomalies makes AEs effective in detecting anomalies, whether in cyber-security for identifying network intrusions (Lopes et al. 2022; An et al. 2022; Lewandowski and Paffenroth 2022), in manufacturing for spotting defects (Papananias et al. 2023; Sudo et al. 2021), or in finance for fraud detection (Du et al. 2022; Jiang et al. 2023; Kennedy et al. 2023). The versatility of AEs and their capacity to adapt to diverse data types contribute to their widespread use in unsupervised anomaly detection scenarios, enhancing system security and reliability.

AEs have been employed effectively in semi-supervised anomaly detection by capitalizing on their capacity to learn rich data representations (Zhou et al. 2023). In this context, a portion of the training data is labeled as normal, while the majority remains unlabeled. The AE is trained to reconstruct the normal data accurately, and during this process, it learns to capture the underlying structure and features of the normal class. When presented with new, unlabeled data, the AE endeavors to reconstruct it (Ruff et al. 2019). Anomalies, which deviate significantly from the learned normal patterns, result in high reconstruction errors. By setting a suitable threshold on the reconstruction error, anomalies can be effectively detected. This semi-supervised approach minimizes the need for extensive labeled anomaly data and has proven effective in various domains, including fraud detection (Charitou et al. 2020; DeLise 2023; Dzakiyullah et al. 2021), network security (Dong et al. 2022; Hara and Shiomoto 2020; Hoang and Kim 2022; Thai et al. 2022), and quality control (Cacciarelli et al. 2022; Sae-Ang et al. 2022), where labeled anomalies are often scarce.

4.6 Speech processing

Speech processing is focused on enabling machines to understand and interpret human speech with the ultimate objective of creating systems that facilitate natural and intuitive interaction between humans and machines (Hickok and Poeppel 2007). AEs have found numerous applications in speech processing, especially in speech denoising (Bhangale and Kothandaraman 2022; Tanveer et al. 2023), speech recognition (Kumar et al. 2022; Sayed et al. 2023), speech representation (Alex and Mary 2023; Seki et al. 2023), speech compression (Li et al. 2021; Srikotr 2022), feature representation (Shixin et al. 2022; Tian et al. 2022), and speech emotion recognition (Dutt and Gader 2023; Gao et al. 2023).

Speech denoising is a vital process aimed at eliminating unwanted noise from speech signals (Azarang and Kehtarnavaz 2020). AEs have emerged as a powerful tool for this task, where the objective is to enhance the quality of speech by removing noise (Hosseini et al. 2021). In the denoising AE framework, the model is trained using noisy speech samples, with the noisy speech serving as the input and the corresponding clean speech as the target. Through this training, the AE becomes adept at reconstructing noise-free speech from noisy inputs, enabling it to effectively denoise unseen speech signals. The encoder component of the AE extracts informative features from the noisy speech, while the decoder component reconstructs the clean speech based on these extracted features. Denoising AEs have demonstrated remarkable efficacy in mitigating various types of noise in speech signals, including background noise, reverberation, and distortion.

Speech recognition is the process of converting spoken words into text or commands that a computer can understand and execute (Gaikwad et al. 2010). AEs can be used in speech recognition as a pre-processing step for feature extraction. The AE can learn to encode the raw audio signals into a more compact and meaningful representation of the speech signal, which can then be used as input to a speech recognition model. This can improve the accuracy and efficiency of speech recognition systems, especially in noisy or variable acoustic environments (Sayed et al. 2023; Wubet and Lian 2022). Additionally, AEs can be used for speaker identification, where the AE can learn to distinguish between different speakers based on their speech patterns (Liao et al. 2022; Rituerto-González and Peláez-Moreno 2021). A popular approach is using a CNN as the encoder to extract local features from the audio signal, and a RNN as the decoder to capture the temporal dependencies in the speech signal, with the output of the RNN decoder able to transcribe the speech signal (Palaz and Collobert 2015; Rusnac and Grigore 2022).

4.7 Other

Autoencoders have diverse applications in fault diagnosis, intrusion detection, and hyperspectral imaging. They help detect faults in systems, identify network intrusions, and enhance the analysis of hyperspectral data for applications like remote sensing. Different autoencoder versions are tailored to meet specific challenges in these domains.

4.7.1 Fault diagnosis

Fault diagnosis is the process of identifying, isolating, and characterizing faults or anomalies in a system or machine. It involves analyzing the behavior of the system or machine and identifying any deviations from normal or expected behavior. Fault diagnosis is critical in various fields, including manufacturing, automotive, aerospace, and healthcare, as it can help prevent failures, reduce downtime, and improve safety and reliability (Gao et al. 2015). Autoencoders have demonstrated significant potential in fault diagnosis applications. By training an autoencoder on normal data, it can detect deviations from the norm, indicating the presence of a fault or anomaly. To use an autoencoder for fault diagnosis, the initial step is to collect a dataset of normal operating conditions for the system or equipment. This dataset is then employed to train the autoencoder to learn the normal data patterns. Subsequently, it can be applied to new data for fault diagnosis by identifying deviations from these learned patterns citeyang2022autoencoder.

One crucial aspect of using autoencoders for fault diagnosis is selecting an appropriate anomaly detection threshold. Typically, this threshold is determined based on the distribution of the reconstruction error for normal data. Any data that produces a reconstruction error exceeding the threshold is flagged as an anomaly (Ma et al. 2018). Autoencoders are effective for fault diagnosis because they can autonomously learn intricate patterns and recognize deviations from those patterns, eliminating the need for explicit feature engineering. This capability makes them well-suited for detecting subtle anomalies that might be challenging to identify using traditional fault diagnosis methods (Lei et al. 2020).

4.7.2 Intrusion detection

The process of intrusion detection involves continuous monitoring of a system or network to identify and respond to instances of malicious activity or breaches of established policies. Its purpose is to detect anomalous behavior or indicators of potential attacks to prevent or mitigate any potential damage (Farahnakian and Heikkonen 2018). Al-Qatf et al. (2018) have proposed a deep autoencoder-based intrusion detection system that utilizes enhanced representative features to enhance intrusion detection accuracy. The autoencoder extracts representative features from network traffic data, which are subsequently employed to train a classification model for intrusion detection. Another technique to improve intrusion detection systems is the use of Stacked Sparse Autoencoders (SSAE). Yan and Han (2018) utilize SSAE, which is trained on a combination of normal and attack traffic to uncover underlying patterns in network traffic data. These extracted features serve as the basis for training a classifier to detect attacks.

Autoencoders can play a significant role in automatic feature extraction for intrusion detection systems. Kunang et al. (2018) propose a method in which an autoencoder is employed to extract relevant features from raw network traffic data. These extracted features are then used as input for a classifier, such as a Support Vector Machine (SVM), to distinguish between normal and malicious traffic. Compared to traditional rule-based or signature-based methods, autoencoders have the potential to enhance the accuracy and efficiency of intrusion detection systems (Ieracitano et al. 2020).

4.7.3 Hyperspectral imaging

AEs find wide-ranging applications in hyperspectral image analysis due to their ability to learn concise representations of high-dimensional data. Hyperspectral imaging is a potent technique for capturing detailed spectral information about objects or scenes. It involves multi-dimensional data where each pixel contains a spectrum of reflectance or radiance values across numerous narrow, contiguous spectral bands (Jaiswal et al. 2023).

AEs are employed for various tasks in managing hyperspectral data, including hyperspectral data compression (Minkin et al. 2021), hyperspectral unmixing (Książek et al. 2022), blind hyperspectral unmixing (Palsson et al. 2022), and dimensionality reduction (Zabalza et al. 2016). In data compression, AEs condense hyperspectral data while retaining crucial information, facilitating subsequent analysis and processing. Hyperspectral unmixing entails decomposing a hyperspectral image into its constituent parts, referred to as endmembers. AEs play a pivotal role in reconstructing the spectral profiles of these identified components (endmembers) and determining their proportional mixing amounts (abundances). This is indispensable for enhancing the efficiency of hyperspectral analysis and classification tasks (Su et al. 2019). Blind hyperspectral unmixing involves deconstructing the recorded spectrum of a pixel into a mixture of endmembers while simultaneously discerning the proportions or fractions of these endmembers within the pixel. Training an AE on hyperspectral images results in a lower-dimensional representation of the data, rendering it more manageable for subsequent analysis (Petersson et al. 2016).

5 Autoencoder libraries and practical applications

The development and availability of open-source libraries for various versions of AEs have greatly facilitated research in this field. Three popular libraries that are widely used for building and training autoencoder models are TensorFlow, PyTorch, and Keras. Each of these libraries has its strengths and is preferred by different segments of the machine learning and deep learning community. Table 5 presented in this section provides a comprehensive overview of the source code for our proposed category of AE variants. Researchers can access these code repositories to implement and test different versions of AEs, and to compare their performance on various tasks. For instance, one could use the available code to train a variational AE for image reconstruction or a graph attention AE for node embedding. These libraries are not only useful for research but also for practical applications, as they enable practitioners to easily deploy pre-trained models on their own datasets. Table 6 presents a comprehensive overview of various AE models and their diverse applications in machine learning. Each model is associated with specific applications, datasets, methodology, evaluation metrics, and performance results. Notable applications include feature learning, dimensionality reduction, graph-based data representation, generative modeling, anomaly detection, and sequential data analysis. The evaluation metrics vary depending on the application but commonly include error rates, accuracy, precision, recall, F1 score, Area Under the Curve (AUC), and more. These AEs demonstrate their effectiveness in tasks ranging from image classification and sentiment analysis to graph representation learning and acoustic novelty detection, showcasing their versatility in addressing a wide array of machine learning challenges across various domains.

Table 5 AE Models and their corresponding years of publication, programming languages, and code repositories
Table 6 AE Models and their corresponding applications

6 Future directions

Despite in-depth research on autoencoders and their improved algorithms in recent years, the following issues still need to be addressed.

6.1 Semi-supervised and self-supervised learning in autoencoder

Autoencoders, a prominent tool in unsupervised learning, primarily function without the need for labeled data. However, a significant research gap lies in exploring their adaptability to semi-supervised learning paradigms. This entails investigating methodologies for integrating labeled information into the training process, potentially enhancing their performance when only limited labeled data is available. Additionally, another intriguing avenue for exploration is the incorporation of self-supervised learning techniques within autoencoder frameworks. Such an endeavor aims to allow autoencoders to autonomously learn meaningful representations from unlabeled data, reducing their reliance on extensive labeled datasets. Addressing these aspects could significantly expand the applicability and effectiveness of autoencoders across various real-world scenarios with limited labeled data resources.

6.2 Hypergraph autoencoder

Autoencoders have proven effective in preserving the non-linear structure of data due to their deep learning capabilities. However, they face a challenge in preserving higher-order neighbors in complex datasets. While autoencoders can address the former concern, they may not inherently handle the latter. To bridge this gap, integrating hypergraph-based representations of data into the autoencoder framework emerges as a potential solution. By transforming the data into a hypergraph and feeding it as input to the autoencoder, it may be possible to preserve the critical high-order neighbor relationships. This approach holds promise for enhancing the utility of autoencoders in scenarios where preserving intricate data dependencies is crucial, potentially leading to improved performance across various applications.

6.3 Tuning parameter with reinforcement learning

Constructing an autoencoder involves crucial decisions about parameters like the number of hidden layers and nodes, which significantly influence the model’s final performance. While parameter selection is essential, the process of identifying the most suitable configuration can be challenging. In current research efforts, some have explored leveraging reinforcement learning techniques in conjunction with autoencoder construction. This novel approach aims to optimize autoencoder parameters efficiently, potentially enhancing model performance. The integration of reinforcement learning into parameter tuning represents an evolving research gap that holds promise for automating and improving the autoencoder design process.

6.4 Handling multi-modal and heterogeneous data with autoencoders

Autoencoders are proficient at capturing patterns in data, especially in scenarios involving different types of data sources or modalities, like text, images, and numerical features, which make data structures more complex. The current challenge lies in effectively handling such multi-modal and heterogeneous datasets. Existing autoencoder models may struggle to efficiently capture and integrate the information present in these intricate datasets. As a result, there is a research gap in developing autoencoder variants or techniques that can adeptly manage multi-modal and heterogeneous data, leading to more comprehensive and valuable data representations. Addressing this gap has the potential to significantly enhance the applicability of autoencoders in various real-world applications.

7 Conclusion

Autoencoders have become a focal point in unsupervised learning due to their remarkable ability to uncover data features and serve as a valuable dimensionality reduction tool. This paper has conducted a thorough examination of autoencoders, covering their fundamental principles and a detailed classification of models based on unique characteristics. We have also explored their use in various areas, from computer vision to natural language processing, highlighting their adaptability. During this study, we’ve recognized both the advantages and occasional drawbacks of autoencoders. By classifying and summarizing these models based on their unique traits, we’ve revealed possible directions for future enhancements and innovations. This insight paves the way for further progress in the field.

In summary, autoencoders have an important role in the field of machine learning, and their significance is continuously growing. They have the remarkable ability to find valuable insights in data and create smart results, which can greatly impact various areas. We expect an ongoing journey of progress and important developments in the field of autoencoders, ultimately leading to the creation of even more powerful and intelligent solutions that benefit society as a whole. Autoencoders are positioned to foster innovation and shape the future of machine learning.