Statistical analysis
To conduct the prescreening procedure and obtain the methylation sites with the most differential methylation expression, a previous statistical analysis of the CpG methylation data was carried out. First, a hypothesis contrast to analyze the level of independence between pairs of variables was performed. For this purpose, the correlation coefficient \(\rho\) and the p-value of the correlation matrix were calculated to remove those variables that meet both p-value \(\le\) \(\alpha\) and \(|\rho |\) \(\le\) 0.90, being \(\alpha\) the level of significance with a value of 0.05 for this application. After that, we performed different contrasting hypotheses to analyze the discriminatory ability of each variable regarding the class. Depending on if the variables fit a normal distribution or not, the hypothesis test performed was the t-student or the Wilcoxon rank sum, respectively. After the statistical analysis, we reduced the 27,578 DNA methylation features of the GSE32393 series to 10,153. These features were the input for the following stage.
Dimensionality reduction
In order to explore the well-known non-supervised algorithms to reduce the data dimensionality based on deep learning techniques, the conventional and the variational autoencoder were tested. In this section, we detail the characteristics of both algorithms and their main differences.
Conventional autoencoder
Autoencoder (AE) is one of the most significant algorithms in unsupervised data representation. The objective of this method is to train a mapping function to ensure the minimum reconstruction error between input and output [22]. As it can be observed in Fig. 2, the conventional autoencoder architecture is composed mainly of two stages: the encoder and the decoder stages. The encoder step is in charge of transforming the input data \(\mathbf{X}\) into a latent representation \(\mathbf{Z}\) through a nonlinear mapping function, \(\mathbf{Z} =f_\phi (\mathbf{X} )\), where \(\phi\) are the learnable parameters of the encoder architecture. The dimensionality of the latent space \(\mathbf{Z}\) is much smaller than the corresponding input data to avoid the curse of dimensionality [32]. Since the latent space is a nonlinear combination of the input data with smaller dimensionality, it can represent the most salient features of the data. The decoder stage produces the reconstruction of the data based on the features embedded in the latent space, \(\mathbf{R} =g_\theta (\mathbf{Z} )\). The reconstructed representation R is required to be as similar to \(\mathbf{X}\) as possible. Therefore, given a set of data samples \(\mathbf{X} =\left\{ x_i,...,x_n \right\}\), being n the number of available samples, the autoencoder model is optimized with the following formula:
$$\begin{aligned} \min _{\theta ,\phi }L_{rec}=\min \frac{1}{n}\sum _{i=1}^{n} ||x_{i}-g_\theta (f_\phi (x_i))||^2 \end{aligned}$$
(1)
where \(\theta\) and \(\phi\) denote the parameters of encoder and decoder, respectively.
The autoencoder architecture can vary between a simple multilayer perceptron (MLP), a long short-term memory (LSTM) network or a convolutional neural network (CNN), depending on the use case. In case the input data are 1-D and unrelated in time, both the encoder and decoder are usually constructed by a multilayer perceptron.
Variational autoencoder
Variational autoencoder (VAE) is an unsupervised approach composed also of an encoder-decoder architecture like the conventional autoencoder aforementioned [29]. However, the main difference between a conventional and a variational autoencoder lies in the fact that the VAE introduces a regularization into the latent space to improve its properties. With a VAE, the input data are coded as a normal multivariate distribution p(z|x) around a point in the latent space. In this way, the encoder part is optimized to obtain the mean and covariance matrix of a normal multivariate distribution: see Fig. 3.
The VAE algorithm assumes that there is no correlation between any latent space dimensions and, therefore, the covariance matrix is diagonal. In this way, the encoder only needs to assign each input sample to a mean and a variance vectors. In addition, the logarithm of the variance is assigned, as this can take any real number in the range \((-\infty , \infty )\), matching the natural output range from a neural network, whereas that variance values are always positive; see Fig. 4.
In order to provide continuity and completeness to the latent space, it is necessary to regularize both the logarithm of the variance and the mean of the distributions returned by the encoder. This regularization is achieved by matching the encoder output distribution to the standard normal distribution (\(\mu =0\) and \(\sigma =1\)).
After obtaining and optimizing the parameters of mean and variance of the latent distributions, it is necessary to take samples of the learned representations to reconstruct the original input data. Samples of the encoder output distribution are obtained as follows:
$$\begin{aligned} Z \approx p(z|x)=\mu + \sigma \cdot \epsilon \end{aligned}$$
(2)
where \(\epsilon\) is randomly sampled from a standard normal distribution and \(\sigma =\exp \left( \frac{\log (\sigma ^2)}{2}\right)\).
The minimized loss function in a variational autoencoder is composed of two terms: (1) a reconstruction term that compares the reconstructed data to the original input in order to get as effective encoding–decoding as possible and (2) a regularization term in charge of regularizing the latent space organization, as shown in Fig. 4. The regularization term is expressed as the Kullback–Leibler (KL) divergence that measures the difference between the predicted latent probability distribution of the data and the standard normal distribution in terms of mean and variance of the two distributions [9]:
$$\begin{aligned} D_{KL}[N(\mu ,\sigma )||N(0,1)]=\frac{1}{2}\sum (1+log(\sigma ^2)-\mu ^2-\sigma ^2) \end{aligned}$$
(3)
The Kullback–Leibler function is minimized to 0 if \(\mu = 0\) and \(log(\sigma ^2)=0\) for all dimensions. As these two terms begin to differ from 0, the variational autoencoder loss increases. The compensation between the reconstruction error and the KL divergence is a hyperparameter to be adjusted in this type of architecture.
Proposed method: deep embedded refined clustering
Once the data dimensionality is reduced, we classify the samples in cancerous and non-cancerous. Reducing the data dimensionality without information about the different subjacent data distributions weakens the representativeness of the embedded features concerning the class. and thereby, the performance of the subsequent classification worsens. For this reason, we consider that dimensionality reduction and classification should be optimized at the same time. In this context, we propose a deep embedded refined clustering (DERC) approach for classifying the DNA methylation data (see Fig. 5). It is composed of an autoencoder in charge of the dimensionality reduction and a cluster assignment corresponding to the unsupervised classification stage (clustering layer in Fig. 5). This approach is trained end-to-end optimizing the dimensionality reduction and the unsupervised classification in the same step and not in two different steps as all the algorithms proposed for DNA methylation analysis in the literature.
During the training process, the encoder and decoder weights of the autoencoder, W and \(W\prime\), respectively, are updated in each iteration in order to refine the latent features of the encoder output Z. The proposed clustering layer (linked to the encoder output) obtains the soft-assignment probabilities \(q_{i,j}\) between the embedded points \(z_i\) and the cluster centroids \(\left\{ \mu _j\right\} _{j=1}^k\) every T iterations, being k the number of cluster centroids. The soft-assignment probabilities (\(q_{i,j}\)) are obtained with the Student’s t-distribution proposed in [32]. Using \(q_{i,j}\), the target probabilities \(p_{i,j}\) are updated (see Algorithm 1). These target probabilities allow the refinement of the cluster centroids by learning from the current high-confidence assignments. To take into account the refinement of the latent space carried out by autoencoder while the samples are classified in one of the two clusters (cancer and non-cancer), the proposed model is trained end-to-end minimizing both reconstruction \({L_{\mathrm{{rec}}}}\) and clustering loss \({L_{\mathrm{{cluster}}}}\) terms:
$$\begin{aligned} L=L_{\mathrm {cluster}}+ \beta L_{\mathrm {rec}} \end{aligned}$$
(4)
where \(\beta\) balances the importance of the losses due to the reconstruction of the data. The term \(L_{\mathrm{{rec}}}\) was defined in Eq. (1), and it is minimized to obtain the maximum similarity between the input and the output data improving the representation of the latent space. \(L_{\mathrm{cluster}}\) is defined by the Kullback–Leibler (KL) divergence loss between the soft-assignments and the target probabilities, \(q_{i,j}\) and \(p_{i,j}\), respectively:
$$L_{\mathrm {{cluster}}}=\sum _{i} \sum _{j} p_{i,j} {\mathrm{log}} \frac{p_{i,j}}{q_{i,j}}$$
(5)
The clustering term is minimized to achieve the soft-assignments \(q_{i,j}\) and the target \(p_{i,j}\) probabilities to be as similar as possible. In this way, the centroids are refined and the latent space obtained by the autoencoder is regularized to achieve a correct distinction between breast cancer and non-breast cancer samples. As discussed above, the hyperparameter \(\beta\) balances the importance of losses due to the data reconstruction. If \(\beta\) is high, the data reconstruction term will predominate and the classification between cancerous and non-cancerous samples will worsen. Otherwise, if this term is too low, the reconstruction losses will be marginal and the features of the latent space will not be optimized correctly. Consequently, the latent features will be very different from the input data, decreasing the accuracy of the unsupervised classification. Therefore, \(\beta\) is a hyperparameter that needs to be properly adjusted. In Step 2 of Algorithm 1, the methodology used to optimize the proposed DERC algorithm is detailed.
Note that to train the proposed method, a previous initialization of the centroids with latent characteristics is necessary (Step 1 of the Algorithm 1). In the experimental section, we present an experiment (Sect. 4.1.1) aimed at determining which of the dimensionality reduction models is optimal for this initialization.