A deep embedded refined clustering approach for breast cancer distinction based on DNA methylation

Epigenetic alterations have an important role in the development of several types of cancer. Epigenetic studies generate a large amount of data, which makes it essential to develop novel models capable of dealing with large-scale data. In this work, we propose a deep embedded refined clustering method for breast cancer differentiation based on DNA methylation. In concrete, the deep learning system presented here uses the levels of CpG island methylation between 0 and 1. The proposed approach is composed of two main stages. The first stage consists in the dimensionality reduction of the methylation data based on an autoencoder. The second stage is a clustering algorithm based on the soft assignment of the latent space provided by the autoencoder. The whole method is optimized through a weighted loss function composed of two terms: reconstruction and classification terms. To the best of the authors’ knowledge, no previous studies have focused on the dimensionality reduction algorithms linked to classification trained end-to-end for DNA methylation analysis. The proposed method achieves an unsupervised clustering accuracy of 0.9927 and an error rate (%) of 0.73 on 137 breast tissue samples. After a second test of the deep-learning-based method using a different methylation database, an accuracy of 0.9343 and an error rate (%) of 6.57 on 45 breast tissue samples are obtained. Based on these results, the proposed algorithm outperforms other state-of-the-art methods evaluated under the same conditions for breast cancer classification based on DNA methylation data.


Introduction
Epigenetic mechanisms are crucial for the normal development and maintenance of tissue-specific gene expression profiles in mammals.Recent advances in the field of cancer epigenetics have shown extensive reprogramming of every component of the epigenetic machinery including DNA methylation, histone modifications, nucleosome positioning and non-coding RNAs [19].In concrete, several studies demonstrate that DNA methylation (DNAm) plays a crucial role in the tumorigenesis process [1,26].
In mammalian cells, DNA methylation is based on the selective addition of a methyl group to the cytosine nucleotide under the action of DNA methyltransferases [23], Fig. 1.Specifically, DNAm takes place in cytosines that precede guanines, known as CpG dinucleotides [5].CpG sites are not randomly distributed throughout the genome but there are CpG-rich areas known as CpG islands often located in the gene promoting regions.CpG islands are usually largely unmethylated in normal cells.The methylation of these CpG sites silences the promoter activity and correlates negatively with the gene expression.The methylation of the promoter regions in some vital genes, such as tumor suppressor genes, and therefore their inactivation, has been firmly established as one of the most common mechanisms for cancer development [3,14].Because the methylation patterns can be observed in the early stages of cancer [19], DNA methylation analysis becomes a powerful tool in the early diagnosis, treatment and prognosis of cancer.
The DNA methylation analysis has experienced a revolution during the last decade, especially due to the adaptation of microarray technology to the study of methylation and the emergence of Next-Generation Sequencing (NGS) [2,16].These technological advances combined with the development of techniques such as reduced representation bisulfite sequencing (RRBS), which is an efficient and high-throughput approach for analyzing the genome-wide methylation profiles, have allowed the DNA methylation analysis at the molecular level [13].This is the reason why, current methylation studies generate a large amount of data.Additionally, since the study of DNA methylation is still a bit expensive, the number of available samples is relatively low.The extremely high dimensions of the methylation data compared to the generally small number of available samples is the main limitation in the development of appropriate methods for DNAm data analysis.This fact makes a dimensionality reduction necessary before implementing any algorithm that identifies the presence of cancer using methylation profiles [25].In this context, Yuvaraj et al. presented different algorithms for dimensionality reduction based on Principal Component Analysis and Fisher Criterion [25].However, the DNA methylation datasets cannot be efficiently described by these dimensionality reduction methods due to its non-Gaussian character.Jazayer et al. used a non-negative matrix factorisation (NMF) for the dimensionality reduction of breast methylation data following by ELM and SVM classifiers for cancer identification.However, with the NMF algorithm, it is not possible to directly transfer the input to a smaller dimensional space than the number of samples because this method transfers the data to an output space with a dimension equal to the minimum of {samples, DNAm dimension}.That is the reason why, in this study, the authors use a columnsplitting method to overcome the curse of the dimensionality problem [11].
Recent advances in the field of artificial intelligence have allowed the development of deep learning algorithms that perform an embedding of CpG methylation states to extract biologically significant lowerdimensional features [12,20,22].Zhongwei et al. presented a stack of Random Boltzmann Machine (RBM) layers with the aim of reducing the dimensionality of a breast DNAm set composed of cancer and non-cancer samples.The proposed model first selected the best 5,000 features based on variance from over 27,000 features and subsequently used four RBM layers to reduce the number of features to 30.After reducing the data dimensionality, they carried out a binary classification of the generated features using unsupervised methods [20].Khwaja et al. proposed a deep autoencoder system for differentiation of several cancer types (breast cancer, lung carcinoma, lymphoblastic leukemia and Urological tumours) based on the DNA methylation states.After a statistical analysis, in which the features providing non-useful information for differentiation between cancer classes are eliminated, the authors used a Deep Belief Network for dimensionality reduction with a posterior supervised classification [12].Titus et al. proposed an unsupervised deep learning framework with variational autoencoders (VAEs) to learn latent representations of the DNA methylation from three independent breast tumour datasets.They demonstrate the feasibility of VAEs to track representative differential methylation patterns among clinical sub-types of breast tumors but they do not perform any classification with the extracted characteristics [22].
Several state-of-the-art methods propose different unsupervised and supervised classification algorithms for cancer identification after performing a dimensionality reduction of the DNA methylation data [11,12,14,20].However, to the best of the author's knowledge, no previous studies have been focused on the development of both tasks simultaneously.Novel deep learning algorithms have emerged optimizing the dimensionality reduction with unsupervised classification at the same time.These methods, called deep clustering algorithms, have outperformed the state-of-the-art results for different tasks as image classification [8,9,24], image segmentation [4], speech separation [10,18] or RNA sequencing [21].Therefore, our hypothesis is that since these algorithms perform well with high-dimensional data, they are likely to perform well for methylation data.
For all of the above, in this work, we proposed a deep embedded refined clustering to distinguish cancer thought DNA methylation data.In concrete, this work is developed using two public databases containing DNAm data from breast tissues with and without cancer.The proposed method is composed of an autoencoder to carry out the dimensionality reduction followed by a soft-assignment algorithm to perform an unsupervised classification.This algorithm is end-to-end trained to accomplish the data classification while optimising the dimensionality reduction.As the main novelty, the method is optimized through a weighted loss function.This loss function is composed of two terms: (1) a reconstruction term in charge of optimizing the latent features provided by the autoencoder algorithm and (2) a clustering term used to improve the classification based on the latent features of the autoencoder.To the best of the authors' knowledge, no previous studies have addressed the distinction of cancer based on DNAm using an end-to-end trained dimensionality reduction and classification method.The proposed method is widely validated and compared to the use of autoencoder and variational autoencoder for dimensionality reduction with a subsequent unsupervised classification.
The rest of the paper is organised as follows: in Section 2 we introduce the databases used in this work, DNAm sets containing the methylation level (between 0 and 1) of different CpG region related to cancer.In Section 3, we describe the methodology.In particular, Section 3.1 describes the statistical analysis performed on the DNAm data, Section 3.2 presents the dimensionality reduction algorithms used in this work, conventional and variational autoencoder, and Section 3.3 describes the details of the proposed deep clustering method.In Section 4, we describe the performed experiments in order to validate our method and in Section 5 we discuss the results obtained.Finally, Section 6 summarises the conclusions extracted with the carried out experiments.

Materials
For this study, we used two methylation datasets obtained from Gene Expression Omnibus (GEO) website [7].GEO is a public functional genomics data repository supporting MIAME-compliant data submissions.Specifically, we used the GSE32393 and the GSE57285 series to evaluate the proposed method.Note that the methodology proposed in this work was applied on the first series.The second series was used as an external database to perform an additional test and demonstrate the robustness of the proposed methodology.The GSE50220 series is composed of breast tissue samples from 114 breast cancers and 23 non-neoplastic breast tissues.The breast cancer tissue samples come from women (mean age 59.4) who were diagnosed with breast cancer.Among the cancers, 33 were at stage 1 and 81 at stage 2/3/4.All 23 non-neoplastic samples are from healthy women (mean age 47.6).The GSE50220 series is composed of breast tissue samples from 36 breast cancer and 9 normal control.Among the breast cancer, 20 were non-irradiated breast cancer and the rest were irradiated tumors.In all cases, to obtain the methylation data, the Illumina Infinium 27k Human DNA methylation Beadchip v1.2 was used at approximately 27,000 CpGs from women with and without breast cancer.For each sample, 27,578 DNA methylation profiles were obtained.The methylation status of each CpG site varies from 0 to 1.Under ideal conditions, a value of 0 means the CpG site is completely unmethylated and the value of 1 indicates the site is fully methylated.

Statistical analysis
To conduct the prescreening procedure and obtain the methylation sites with the most differential methylation expression, a previous statistical analysis of the CpG methylation data was carried out.First, a hypothesis contrast to analyze the level of independence between pairs of variables was performed.For this purpose, the correlation coefficient ρ and the p-value of the correlation matrix were calculated to remove those variables that meet both p-value ≤ α and |ρ| ≤ 0.90, being α the level of significance with a value of 0.05 for this application.After that, we performed different contrasting hypotheses to analyze the discriminatory ability of each variable regarding the class.Depending on if the variables fit a normal distribution or not, the hypothesis test performed was the t-student or the Wilcoxon Rank-Sum, respectively.After the statistical analysis, we reduced the 27,578 DNA methylation features of the GSE32393 series to 10,153.These features were the input for the following stage.

Dimensionality reduction
In order to explore the well-known non-supervised algorithms to reduce the data dimensionality based on deep learning techniques, the conventional and the variational autoencoder were tested.In this section, we detail the characteristics of both algorithms as well as their main differences.
-Conventional autoencoder Autoencoder (AE) is one of the most significant algorithms in unsupervised data representation.The objective of this method is to train a mapping function to ensure the minimum reconstruction error between input and output [17].As it can be observed in Fig. 2, the conventional autoencoder architecture is composed mainly of two stages: the encoder and the decoder stages.The encoder step is in charge of transforming the input data X into a latent representation Z through a non-linear mapping function, Z = f φ (X), where φ are the learnable parameters of the encoder architecture.The dimensionality of the latent space Z is much smaller than the corresponding input data to avoid the curse of dimensionality [24].Since the latent space is a non-linear combination of the input data with smaller dimensionality, it can represent the most salient features of the data.The decoder stage produces the reconstruction of the data based on the features embedded in the latent space, R = g θ (Z).The reconstructed representation R is required to be as similar to X as possible.Therefore, given a set of data samples X = {x i , ..., x n }, being n the number of available samples, the autoencoder model is optimized with the following formula: where θ and φ denote the parameters of encoder and decoder, respectively.The autoencoder architecture can vary between a simple multilayer perceptron (MLP), a long short-term -Variational autoencoder Variational autoencoder (VAE) is an unsupervised approach composed also of an encoder-decoder architecture like the conventional autoencoder aforementioned [22].However, the main difference between a conventional and a variational autoencoder lies in the fact that the VAE introduces a regularisation into the latent space to improve its properties.With a VAE, the input data is coded as a normal multivariate distribution p(z|x) around a point in the latent space.In this way, the encoder part is optimized to obtain the mean and covariance matrix of a normal multivariate distribution, see Fig. 3.
The VAE algorithm assumes that there is no correlation between any latent space dimensions and, therefore, the covariance matrix is diagonal.In this way, the encoder only needs to assign each input sample to a mean and a variance vectors.In addition, the logarithm of the variance is assigned, as this can take any real number in the range (−∞, ∞), matching the natural output range from a neural network, whereas that variance values are always positive, see Fig. 4.
In order to provide continuity and completeness to the latent space, it is necessary to regularize both the logarithm of the variance and the mean of the distributions returned by the encoder.This regularisation is achieved by matching the encoder output distribution to the standard normal distribution (µ = 0 and σ = 1).
After obtaining and optimizing the parameters of mean and variance of the latent distributions, it is nec-  essary to take samples of the learned representations to reconstruct the original input data.Samples of the encoder output distribution are obtained as follows: where is randomly sampled from a standard normal distribution and σ = exp( log(σ 2 ) 2 ).The minimized loss function in a variational autoencoder is composed of two terms: (1) a reconstruction term that compares the reconstructed data to the original input in order to get as effective encoding-decoding as possible and (2) a regularisation term in charge of regularizing the latent space organization, Fig. 4. The regularisation term is expressed as the Kulback-Leibler (KL) divergence that measures the difference between the predicted latent probability distribution of the data and the standard normal distribution in terms of mean and variance of the two distributions [6]: The Kulback-Leibler function is minimised to 0 if µ = 0 and log(σ 2 ) = 0 for all dimensions.As these two terms begin to differ from 0, the variational autoencoder loss increases.The compensation between the reconstruction error and the KL divergence is a hyper-parameter to be adjusted in this type of architecture.

Proposed method: Deep embedded refined clustering
Once the data dimensionality is reduced, we classify the samples in cancerous and non-cancerous.Reducing the data dimensionality without information about the different subjacent data distributions weakens the representativeness of the embedded features concerning the class and thereby, the performance of the subsequent classification worsens.For this reason, we consider that dimensionality reduction and classification should be optimized at the same time.In this context, we propose a deep refined embedded clustering (DERC) approach for classifying the DNA methylation data, see Fig. 5.It is composed of an autoencoder in charge of the dimensionality reduction and a cluster assignment corresponding to the unsupervised classification stage (clustering layer in Fig. 5).This approach is trained end-to-end optimizing the dimensionality reduction and classification in the same step and not in two different steps as all the algorithms proposed for DNA methylation analysis in the literature.
During the training process, the encoder and decoder weights of the autoencoder, W and W respectively, are updated in each iteration in order to refine the latent features of the encoder output Z.The proposed clustering layer (linked to the encoder output) obtains the soft-assignment probabilities q i,j between the embedded points z i and the cluster centroids {µ j } k j=1 every T iterations, being k the number of cluster centroids.The soft-assignment probabilities (q i,j ) are obtained with the Student's t-distribution proposed in [24].Using q i,j , the target probabilities p i,j are updated, see Algorithm 1.These target probabilities allow the refinement of the cluster centroids by learning from the current high-confidence assignments.To take into account the refinement of the latent space carried out by autoencoder while the samples are classified in one of the two clusters (cancer and non-cancer), the proposed model is trained end-to-end minimizing both reconstruction L rec and clustering loss L cluster terms: where β balances the importance of the losses due to the reconstruction of the data.The term L rec was defined in Equation ( 1) and it is minimized to obtain the maximum similarity between the input and the output data improving the representation of the latent space.L cluster is defined by the Kullback-Leibler (KL) divergence loss between the soft-assignments and the target probabilities, q i,j and p i,j respectively: The clustering term is minimized to achieve the softassignments q i,j and the target p i,j probabilities to be as similar as possible.In this way, the centroids are refined and the latent space obtained by the autoencoder is regularized to achieve a correct distinction between breast cancer and non-breast cancer samples.As discussed above, the hyper-parameter β balances the importance of losses due to the data reconstruction.If β is high, the data reconstruction term will predominate and the classification between cancerous and noncancerous samples will worsen.Otherwise, if this term is too low, the reconstruction losses will be marginal and the features of the latent space will not be optimized correctly.Consequently, the latent features will be very different from the input data, decreasing the accuracy of the classification.Therefore, β is a hyperparameter that needs to be properly adjusted.In Step 2 of Algorithm 1, the methodology used to optimize the proposed DERC algorithm is detailed.Note that to train the proposed method, a previous initialization of the centroids with latent characteristics is necessary (Step 1 of the Algorithm 1).In the experimental section, we present an experiment (Section 4.1.1)aimed at determining which of the dimensionality reduction models is optimal for this initialization.
Algorithm 1: Proposed methodology for the DERC approach.
Input: Methylation data X; number of clusters k; update interval T ; batch-size bs; learning rate lr; number of samples n.Output: Cluster assignment {c i } n i=1 of each methylation sample {x i } n i=1 .
(2) Obtain the centroid initialisation {µ j } k j=1 by running K-means on the latent space Z of the pre-trained autoencoder.
Step 2: Clustering using the proposed DERC method End-to-end DERC optimization: , j = j ; update p i,j ←

;
Final prediction stage: for i ← 1 to n do c i = argmax j (q i,j ) Fig. 5 Architecture of the proposed method (DERC) to detect breast cancer using DNA methylation data.The proposed algorithm is trained minimizing both, clustering and reconstruction loss.

Experimental results
As we mentioned in Section 2, the DNA methylation databases used were obtained from the Gene Expression Omnibus (GEO) website.In this section, we used the dataset GSE32393 to evaluate the dimensionality reduction and the unsupervised deep clustering performance.The dataset GSE50220 was used as an external validation to demonstrate that the proposed method can generalise to other breast methylation databases.It should be noticed that all experiments were performed on an Intel i7 @ 3.10 GHz of 16 GB of RAM with a Titan V GPU of 12 GB of RAM.The proposed methods were executed in Python 3.5 using TensorFlow 2.0.

Dimensionality reduction and classification separately
As mentioned above, an initial latent space with a lower dimensionality than the input data for the cluster centroid initialisation is necessary.In this section, we detail a comparison between the latent space obtained using the conventional and the variational autoencoder and the classification results after applying the K-means algorithm on each latent space.In this way, it will be demonstrated which algorithm is the most suitable for dimensionality reduction in the end-to-end proposed method.
Ablation experiment.The 10,153 CpG sites obtained after statistical analysis of the raw methylation data were the input of the proposed dimensionality reduction algorithms, conventional and variational autoencoders.In both cases, the dimensionality reduction was carried out using an architecture composed of 4 stacks.The number of neurons (input, output) of the 3 top layers were set to {(10153, 2000), (2000, 500), (500, 70)}, respectively (see Fig. 6).These layers were composed of a dense layer with ReLU as activation function except for the last decoder layer that was constituted of the sigmoid function in order to obtain an output value between 0 and 1, range of the methylation data values.The kernel weights were initialised with random numbers drawn from a uniform distribution within [−l, l], where l = 3 • s/n input , being s = 1  3 and n input the number of input units.The top layer output (latent space dimension) was set to {10, 20, 30}.Note that, these settings were obtained from empirical evaluations with a wide range of settings and we use only the best parameters here.After intense experiments, the optimal dimension of the latent space for both algorithms turned out to be 10 neurons.
To show the performance of AE and VAE and to demonstrate that they are not over-adjusted to the data, a first experiment was performed using 10% of the GSE32393 database as a validation set and the rest 90% as a training set.Subsequently, both algorithms were trained using the whole database (Entire prediction).The optimal hyper-parameters combination was achieved by training both algorithms during 300 epochs, using the Stochastic Gradient Descent (SGD) optimizer with a learning rate of 1 and a batch-size of 8. Regarding the loss function, in the case of the conventional autoencoder, the mean square error (MSE) was used.However, the variational autoencoder loss function was composed of two terms: MSE weighted by 0.8 and Kullback-Leibler (KL) divergence.
After training the dimensional reduction algorithms with the entire GSE32393 series and obtaining the features in the embedding space (encoder output), the classification results were obtained using K-means.To achieve this classification, we ran K-means with 80 restarts and selected the best solution.
Qualitative and quantitative results.After training the proposed dimensionality reduction algorithms, the results in terms of reconstruction error for both autoencoders are shown in Table 1.
In order to visualise in a qualitative way the effect of the tested dimensionality reduction methods (AE and VAE) over the data distribution, we used the t-distributed stochastic neighbour embedding (t-SNE) method to represent the latent space into a twodimensional space.T-SNE is a nonlinear dimensionality reduction technique that embeds high-dimensional data into a space of two or three dimensions, which can then be visualised by a scatter plot [15].In Fig. 7, we show the representation of the data (latent space of the pre-trained autoencoder (a) and latent space of the pre-trained variational autoencoder (b)) in a twodimensional space.
To quantitatively evaluate the performance of the clustering assignments, several metrics were computed: the unsupervised clustering accuracy (ACC), the error rate (%) and the false positive (FP) and the false negative (FN) ratios.The ACC metric [17] is defined as follows: where y i is the ground-truth label, c i is the cluster assignment generated by the algorithm, and m is a mapping function which ranges over all possible one-to-one mappings between assignments and labels.The error rate (%) is calculated according to the following formula: In Table 2, the above-mentioned metrics were calculated for the input data + K-means clustering, the latent space of the pre-trained autoencoder(AE)+Kmeans and the latent space of the pre-trained variational autoencoder (VAE)+ K-means.Note that the input data is referred to the 10,153 features extracted after the statistical analysis.

Dimensionality reduction and classification jointly (DERC)
After initializing the centroids using the algorithm with the lowest losses and the better prediction when Kmeans was used, in this case the conventional autoencoder, the deep embedded refined clustering algorithm was trained.As we explained in Section 3.3, the β value, which weights the terms that composed the loss function of the DERC algorithm, is an important parameter  to adjust.For this reason, we develop in this section a comparison between different β values exposing their influence on the clustering assignment.Ablation experiment.After pre-training the conventional autoencoder model (with the parameters detailed in Section 4.1.1),we added the clustering layer to the output of the autoencoder latent space, see Table 3 for the layer layout in the final architecture.In order to evaluate the β value on the performance of the clustering algorithm, we kept the rest of the hyper-parameters constant during the different experiments.In particular, the entire deep embedding clustering method was optimised by stochastic gradient descent (SGD) with a learning rate of 0.01 and a momentum of 0.9.The proposed method was trained during 50 epochs using a batch-size of 8 samples and the target distribution of the clustering layer was updated every 10 iterations.Note that these hyper-parameters were obtained from empirical evaluations with a wide range of settings.β was a variable parameter and various experiments were conducted by setting its value with {0.95, 0.85, 0.75, 0.65}.
Quantitative results.In this case, we show the classification results provided by the proposed deep embedded refined clustering (DERC) method depending on the value of β (see Table 4).4.2 GSE50220 Series: Generalization ability of the DERC algorithm In this section, we expose the results for the prediction of an external test set, see Table 5.The goal of this section is to demonstrate that the proposed DERC method could be valid to perform the feature extraction and classification from methylation data.Therefore, we made use of the GSE50220 series as an external test set to check the behaviour of the proposed methods with new breast cancer samples.

Discussion
In this work, we present a deep embedded refined clustering approach to automatically detect patients suffering for cancer using DNA methylation data.In concrete, the proposed algorithm was evaluated using two breast methylation datasets.As it can be observed in Table 2, an optimal data dimensionality reduction is essential to improve the classification results when working with high-dimensional data.Using the K-means algorithm as non-supervised classifier, the dimensionality reduction carried out by the pre-trained AE on the input data (DNAm profiles obtained after statistical analysis) improves the ACC results from 0.6715 to 0.9343.However, with the VAE algorithm, the ACC results do not improve, obtaining a value of 0.5693.As it can be seen in Fig. 7, the latent space of the VAE is centered around 0 due to the regularisation effect.This fact makes it impossible to distinguish between the different classes.Additionally, the reconstruction losses obtained by the VAE are higher than those reached by the AE (see Table 1).Therefore, it can be concluded that the conventional autoencoder is the most suitable algorithm to reduce the DNAm dimensionality.
Moreover, regarding the comparison between classifying separately and jointly to the dimensionality reduction, ACC results show an improvement from 0.9343 to 0.9927 when the dimensionality reduction and the classification are optimized all at once.In table 4, it can be observed the effect of the β value on the classification results.In this way, it can be demonstrated that when the contribution of the reconstruction losses is too low (low β) or too high (high β), the ACC results are worse compared to a more balanced contribution.However, all the accuracy results shown in Table 4, joint optimization, are higher than those obtained when applying K-means in the autoencoder latent space, separate optimization.Therefore, it is proven that when the classification is carried out at the same time as the dimensionality reduction is optimized, the best results are obtained.
This fact can also be demonstrated when the results of the proposed method are compared to those obtained in the literature [11,20].As discussed in Section 1, in [20], the authors proposed a dimensionality reduction algorithm followed by several unsupervised classification algorithms.They applied their methods to the same breast cancer database used in this paper (GSE32393 series).Note that they obtained an error rate of 2.94 using a deep neural network (DNN) following a self-organizing feature map (SOM) compared to 0.73 obtained by the proposed method.Furthermore, their algorithm predicted 4 cancer examples as healthy, while our method only misclassifies one cancer sample.Therefore, it confirms that the proposed deep embedded refined clustering algorithm improves the results when it is applied to DNA methylation data.Additionally, the algorithm proposed in [11] used the same breast cancer database (GSE32393 series).In this case, they used a Non-negative matrix factorisation (NMF) for dimensionality reduction following by supervised algorithms for classification.Their main limitation is that, according to the authors, the NMF algorithm cannot directly reduce the number of features (27,578) to a lower dimension than the number of samples (137).Therefore, they used a method called column-splitting in which they separated the original data into different matrices.They could not reduce the original data to a single latent space because they had to reduce each data matrix independently.In this way, the overall information of all original features is not taken into account.They used a K-fold for the algorithm validation and obtaining a 100 % of accuracy when they used 900 and 2700 CpG sites.However, both resulting models were overfitted as it is demonstrated when they reduced the number of features to 540 and the accuracy dropped to 97.85 %, lower than achieved with the proposed method.Additionally, the authors of [11] claimed that it is important to reduce the number of features to a smaller space than the total number of examples.However, they were only able to reduce the features to 540 (due to NMF restrictions) which is about 5 times the number of examples they used to train their models.
Furthermore, the results obtained by our proposed method on the external database (GSE50220 series) reported closely similar values to those reached in the primary set (GSE32393 series).This fact indicates that the proposed deep clustering model is perfectly applicable to other breast tissue databases (see Table 5).

Conclusion
In this paper, a deep embedded refined clustering based on breast cancer classification using DNA methylation data has been presented.To the best of the authors' knowledge, no previous studies using DNA methylation are based on algorithms that can optimise the dimensionality reduction and the classification of the data at the same time.As demonstrated throughout the manuscript, the method proposed in this paper improves the results of algorithms using dimensionality reduction and subsequent classification.
The proposed method allows the breast cancer classification using a latent space of only 10 features, which means a reduction in the dimensionality of 99.9637 %.The technology used in this study for data acquisition is the Illumina Infinium 27k Human DNA methylation Beadchip v1.2 which uses probes on the 27k array target regions of the human genome to measure methylation levels at 27,578 CpG dinucleotides in 14,495 genes.As verified through this work, many of the CpG sites obtained in the DNA methylation analysis are not relevant in the breast cancer classification.After ensuring model viability with a larger breast cancer database, the CpG sites from which the level of methylation is obtained could be reduced decreasing the cost and time of methylation analysis.Therefore, this work could contribute to a faster and more effective diagnosis of breast cancer, improving cancer care and advancing the future of breast cancer research technologies.
From a technical perspective, future lines of work will focus on adapting and applying the proposed method to identify and appropriately classify other challenging disorders, such as melanocytic tumours.In this way, the general applicability of the model for the detection of different types of cancer could be demostrated.

Fig. 1
Fig. 1 DNA methylation process.Methylation at 5' position of the cytosine catalyzed by DNMT (DNA methyltransferases) in the presence of S-adenosyl methionine (SAM).

Fig. 2
Fig. 2 Architecture of the proposed conventional autoencoder used for the non-supervised dimensionality reduction.

Fig. 3
Fig. 3 Main differences between a conventional and a variational autoencoder.Instead of just learning a function representing the data (a compressed representation) like conventional autoencoders, variational autoencoders learn the parameters of a probability distribution representing the input data.

Fig. 4
Fig.4Architecture of a variational autoencoder.The proposed algorithm is optimized by minimizing two loss functions.One of them corresponding to the latent space regularisation and the other one corresponding to the input data reconstruction.

Fig. 6
Fig. 6 Final dimensionality reduction architectures.(a) Conventional autoencoder architecture composed of 4 stacks.(b) Variational autoencoder architecture composed of 4 stacks.Note that with the variational autoencoder algorithm, the latent space is obtained in two stages, two dense layers of 10 neurons representing the mean and the logarithm of the variance of the latent distribution and a sampling layer to obtain the points of the latent space.

Fig. 7
Fig. 7 Latent space of the dimensionality reduction algorithms.(a) Visualisation of 10-dimensional features extracted by the latent space of the pre-trained variational autoencoder.(b) Visualisation of 10-dimensional features extracted by the latent space of the pre-trained conventional autoencoder.

Table 2
Comparison of the K-means clustering effect based on different feature extraction.

Table 3
Architecture of the proposed deep embedded refined clustering model.

Table 4
Comparison of the clustering effect of the proposed DECR based on different β values.

Table 5
Results obtained over the external dataset.Note that in this case, the input data corresponds to the GSE50220 series after selecting the CpG sites extracted with the statistical analysis in the GSE32393 series.