1 Introduction

Cancer results from highly expressed genes due to mutations or alterations in gene regulations that control cell division and cell growth. In such cases, a set of genes called oncogene contribute to conversion of normal cells into a cancerous cells. The change in the structure of occurring genetic aberrations, such as somatic mutations, copy numbers (CNs), profiles, and different epigenetic alterations, is unique for each type of cancer [8, 45]. As a result, gene expression (GE) can be disrupted by cell division, environmental effects, or genetically inherited from parents. Changes in GE sometimes change the production of different proteins, affecting normal cell behavior. The damaged cells start reproducing more rapidly than usual and gradually increase in the affected area by forming a tumor. Intermittently, such tumors turn into a type of cancer [42, 54]. This is one of the reasons cancer cases are gradually increasing every year and have become the second leading cause of death worldwide.

Consequently, more than 200 types of cancers have been identified, in which each type can be characterized with different molecular profiles requiring unique therapeutic strategies [45]. According to a statistic from the National Cancer Institute, there were around 14.1 million cancer cases in 2012 in which as many as 8.8 million people died of five leading cancers of lung, liver, colorectal, stomach, and breast [3]. In 2018, an estimated 17.35 million new cases of cancer have been diagnosed in the USA in which 609,640 people died. The number of new cancer cases per year is expected to rise to 23.6 million by 2030, which is anticipated to increase further to 70% by 2035 [46].

With the high-throughput next-generation sequencing technologies, cancer-specific genetic profiling is now possible. The entire genome sequencing data can be used to identify similar genetic mutations and genetic variations associated with different tumors. In particular, copy number variations (CNVs) signify gene or genomic regions that appear in different numbers of copies in different individuals or even in different cells of the same individual, where CN can vary from one person to another by several thousand. About 5–12% of the human genome, including thousands of genes, may be variable in CN, and this variation can be new deletions or duplications of the genome or inherited from the parents by healthy individuals [39], ranging in size from 100 bp to 3 Mb [52].

Although the significance is not fully understood, it is likely that CNVs are responsible for a considerable proportion of phenotypic variation [39]. Such variations may lead to changes in gene dosage and expression [12]. Approximately 179,450 human CNVs were reported in the Database of Genomic Variants (DGV) [20, 52]. Although there are substantially fewer reported CNVs than SNPs, it is estimated that more than 30% of the human genome is covered by at least one CNV compared to the \(<1\)% covered by SNPs. Due to the CN changes in DNA segments, GE is changed by disrupting coding sequences, perturbing long-range gene regulations, or altering gene dosages [50]. CNVs result in variations in GE and abnormalities in the human phenotypes [7]. Thus, CNVs are hypothesized to be of functional significance. These changes in GE are responsible for different phenotypic variations or diseases (e.g., disabilities, diabetes, schizophrenia, cancer, and obesity) or envisaged to be associated with other diseases, e.g., autism spectrum disorder [4, 34, 37].

CNVs are also associated with other complex disease susceptibility, which becomes more obvious while analyzing different tumor types, which became more obvious as CNVs are associated with the risk of individual cancer [27, 33, 48] (e.g., risk of pancreatic cancer [19]). For example, CNVs that play a role in genetic predisposition to pancreatic adenocarcinoma [48] are associated with breast cancer risk and prognosis [27] and are responsible for the spatial pattern change in colorectal cancer [33]. The presence of CNVs in cancer patients’ cells is abundantly high, which is very different than healthy cells. Consequently, many clinically important CNVs are outcomes of duplication or deletion of a genomic region with 1 kb (or shorter) length. Therefore, an accurate identification of cancer based on CNVs can help to reveal the genetic predisposition for cancer before cancer actually occurs, so that vigilant prevention and rigorous monitoring may be practiced by those who are highly predisposed.

Previously, CNVs were used to identify minimum redundancy maximum relevance feature selection (mRMR) and incremental feature selection (IFS) methods [53] such that only one sample could be covered at a time or using microarray-based comparative genomic hybridization (aCGH) methods. Then, the extracted CNVs data were used to train machine learning (ML) models for cancer identification and type prediction. These approaches, however, are not capable of simultaneous analysis of multiple samples and recurrent CNVs [32]. In particular, using deep learning (DL) models based on deep neural network (DNN) architectures, such as autoencoders (AEs), more complex and higher-level features can be embedded from the input data and contextual information can be captured [1, 24]. Eventually, learning nonlinear mappings allows transforming and mapped input data space into a lower-dimensional feature space [21]. To emphasize DL-based cancer diagnoses, first we use the MSeq-CNV [32] tool to extract the recurrent CNVs. Since MSeq-CNV uses mixture density for modeling aberrations in depth of coverage and abnormalities in the mate pair insertion sizes, common CNVs across multiple samples can be detected easily.

Since CNVs data are very high dimensional and dynamic, we hypothesize that a neural ensemble method can be effective at analyzing such high-dimensional data through the hierarchical learning process. We train and evaluate Conv-LSTM and convolutional autoencoder (CAE) networks separately, which follow the model averaging ensemble of the classifiers to combine the predictive power of multiple model snapshots from these networks. Further, we provide interpretations (using guided-gradient class activation map (GradCAM++) [6]) about the prediction made by the neural network in order to make the cancer diagnosis more effective and transparent. The contributions of this paper can be summarized as follows:

  • We employed MSeq-CNV tool to extract CNV features based on protein-coding genes and oncogenes in order to prepare two rich-labeled training datasets.

  • We trained Conv-LSTM and CAE networks to capture both locally and globally important features, where unsupervised pretraining is performed to handle the data sparsity.

  • We aggregated the snapshot models from both architectures using MEA to make single predictions.

  • We developed a novel approach to generate heat maps (HM) for all the classes sharing prominent pixels across CNV samples and computed the feature importance in terms of mean absolute impact (MAI).

  • We identified important and common biomarkers across cancer types and provided interpretations of the predictions made by the models.

  • We provided in details the analyses of the outcomes and comparisons with the state-of-the-art approaches.

The rest of the paper is structured as follows: Sect. 2 discusses related works and their potential limitations. Section 3 chronicles the details of data collection and feature engineering process before the network construction and training. Section 4 demonstrates the experimental results and discusses key findings of the study. Finally, explanations of the importance and relevance of the study are reported by highlighting the limitations and discuss some future works before concluding the paper in Sect. 5.

2 Related works

For analyzing genomics data and decision making for cancer treatment, different ML and DL algorithms were trained using mixed data types: genomic data, bioimaging data, and clinical outcomes, as shown in Table 1. These approaches are not only proven to be useful at improving cancer prognosis, diagnosis, and treatments, but also revealed the subtypes information about different cancer types [19]. For example, RNA-seq data are used more widely [9, 31, 36] to identify rare and common transcripts, isoforms, and noncoding RNAs in cancer. Other forms of genomic data like single-nucleotide polymorphism (SNP) (i.e., identifies segmental variations across multiple cancer genomes) and array-based DNA methylation sequencing (i.e., identifies epigenetic changes in the genome) are useful for early genetic changes in cancer, e.g., early-stage and metabolomic detection of ovarian cancer is now possible [14, 45].

Unlike conventional cancer typing methods that work based on morphological appearance analysis, GE levels of the tumor are used to differentiate tumors that have similar histopathological appearances giving more accurate tumor typing results for certain cancer types, e.g., colorectal cancer diagnosis [41]. Further, different types of somatic mutations data such as point mutation, single-nucleotide variation (SNV), small insertion and deletion (INDEL), copy number aberration (CNA), translocation, and CNVs are used. Literature [51] has observed that somatic mutation data are not only associated with complex diseases but also contribute to the growth of different types of cancers. In particular, literature [5] studied CN changes by comparing healthy and cancer-affected patients, which showed that amplification and deletion of certain genes are more common in certain cancer patients than in healthy people.

Table 1 Different cancer detection methods, data types, and performance

CNVs analysis based on different statistical methods is also used to identify significant CN associated with different types of cancers. For example, Fisher’s exact (FE) test is applied on patient and control groups to identify CN for hereditary breast and ovarian cancer [28]. Although FE is mainly used for CNV analysis [17], ML-based approaches are trending to improve the accuracy of cancer susceptibility, recurrence, and survival prediction [26].

The main challenges are, however, accurate extraction of CNVs and dealing with dimensionality. ML algorithms such as Bayesian networks, SVM, and decision trees are applied effectively to extract the most significant CNVs features for high-dimensional data. In comparison with ML-based approaches, recent DL techniques have shown more accurate and promising results for cancer identification in some studies. In particular, CNN is widely applied [8] on whole slide images in order to detect cancer regions with a very high degree of precision, which is mainly because CNN can extract deep features from different cohorts simultaneously. Literature [9] used a stacked denoising autoencoder to extract features from the RNA-seq data, which are then feed into SVM and shallow ANN to classify malignant or benign tumor of breasts [7]. DeepCNA is another approach proposed for cancer-type prediction based on CNVs and chromatin 3D structure with CNN [51].

Apart from these works, restricted approaches have been proposed based on CNVs for cancer risk and type predictions [11, 13, 53], and literature [11] used recurrent CNVs from non-tumor blood cell DNAs of non-cancer subjects about hepatocellular carcinoma, gastric cancer, and colorectal cancer patients. They were found to reveal the differences between cancer patients and controls with respect to CN losses and CN gains. Although their study can make predictions on the cancer predisposition of an unseen test group of mixed DNAs with high confidence, it is limited to only Caucasian and Korean cohorts.

Ning et al. [53] used CNVs at a level of 23,082 genes for 2916 instances from cBioPortal for Cancer Genomics to classify six different types of cancers, such as breasts, bladder urothelial, colon, glioblastoma, kidney, and head and neck squamous cell. They construct a dagging-based classifier in which the feature space was reduced to CNVs of 19 genes using mRMR and IFS methods. They managed to achieve an accuracy of 75%. Their study indicates that only a few genes may play important roles in differentiating cancer types. Then, Sanaa et al. [13] used the same dataset and extended it by training seven ML classifiers in which random forest algorithm shows 86% accuracy.

In a previous approach [22], we considered segmentation as an important feature because it represents number of CNVs at a DNA location, and the higher the segmentation mean, the higher the CN would be in that region. Then, we calculated the length of a CN and its value based on the difference between the start and end positions of a CNV to extract CNV features. We represented CN loss with negative segmentation means and amplifications of CN with positive segmentation means. CN with segmentation values between a certain range was considered as noise and discarded from rest of the calculation. However, a manual approach for CNV extraction like this often fails to extract non-trivial, high-quality recurrent CNV features in the case of simultaneous analysis of multiple samples [32]. Consequently, MSeq-CNV is used in our study for more efficient extraction of CNVs.

3 Materials and methods

In this section, we discuss the data collection, preprocessing, and feature engineering in detail, followed by the preparation of training, validation, and test sets to train the neural networks.

3.1 Data collection

We used the CNVs data from TCGA in our study in which both CN and gene coordinates of different cancer types are considered. These data are hybridized by genome sequencing Affymetrix SNP 6.0 technology, which allows us to examine the largest number of cases along with the highest probe density [40]. Two different types of samples, such as tumor tissue and healthy tissue, are collected from each cancer patient, which are curated from the blood, bone marrow, buccal cell, EBV-immortalized, and solid tissues. For consistency across samples, all data were downloaded from the same platform but with different project ID and gender. However, blood-derived healthy samples of only 14 cancer types were downloaded having at least 400 samples, as shown in Table  2.

Table 2 Number of CNV samples across 14 different tumor types used in this study

We consider gender an important feature for cancer sub-typing, since tumors like BRCA and OV are not common in males, whereas PRAD is only common in males. TCGA do not specify if a sample is curated from healthy tissue or tumor tissue, so we downloaded CNV files in separate groups, where each cancer sample is grouped as primary and blood-derived samples during filtering before grouping into male and female. Gene locations in combination with the CN data from all patients were used, where the gene coordinates were collected from cytobands using the ensemble library of biomaRt packageFootnote 1. Additionally, oncogenes information was downloaded from the TCGA portal.

3.2 Data preprocessing

Some initial reprocessing was required to remove noise and empty string values from both CNV and gene data. For each cancer type, samples were downloaded into four different groups: healthy samples from male, healthy samples from female, tumor samples from male, and tumor samples from female. Then, we preprocessed CNV data for 15,699 separate samples. Then, all the cancer samples and healthy tissue samples were combined. We found that the distribution of CNs across tumor samples is very different, e.g., BRCA has almost double the samples of most other tumors.

STAD and BLCA have fewer samples compared to other tumor samples except for those having samples between 500 and 600. On the other hand, more than 400 THCA samples have 50–100 CNVs per sample and around 50 OV tumor samples have less than 200 CNVs per sample. Although most tumor samples have CNs between 50 and 400, there are a few samples having more than 1500, giving on average 226.80 CNs per sample. On average, 106.92 CNs exist per healthy sample, which is less than half of tumor samples. Usually, healthy human cells have very different CNs in terms of location, length, and number and vary in humans. This makes the CNVs of the human genome an utterly complex and dynamic structure.

For a cancer-affected patient, this structure becomes even more complex and dynamic as the tumor grows. We solve this dynamic dimension problem by using gene locations of the human genome since DL models expect fixed dimension inputs only. Using MSeq-CNV, we selected a fixed number of genes and extracted the CN that overlapped with the gene locations, removing them from the protein noncoding gene because arguably more than 80% of human genes do not encode any protein, i.e., CNs from these regions have little-to-no effect on the tumor growth. Then inspired by literature [2], we set four thresholds, which divide the CNV spectrum into five regions:

  • \(-2\) signifies the deletion of both copies,

  • \(-1\) signifies the deletion of one copy,

  • 0 has exactly two copies, i.e., no CN gain/loss,

  • 1: denotes a low-level CN gain,

  • 2 denotes a high-level CN gain.

Then, we prepared two different training (i.e., representations) sets based on oncogenes and protein-coding genes used to train the Conv-LSTM and CAE networks.

3.3 Feature extraction based on protein-coding genes

We used CNVs about 20,308 protein-coding genes. Since not all of these genes are responsible for tumor growth, weak features from irrelevant genes make the data unnecessarily complex, which might make the neural networks to fail to converge the training, resulting in poor performance. Although using protein-coding genes makes the dimension fixed, 95% of the features will be empty across the samples, e.g., a sample with 500 CNVs has 19,808 empty features making empty features for 19,808 cases. Using gene locations introduced two additional complexities: curse of dimensionality and sparsity.

To solve these problems, we performed CNV analysis in combination with 568 oncogenes responsible for the majority of cancer types. While oncogenes are found to be related to tumor growth, researchers have not confirmed which other genes are also responsible for its growth. In such a setting, consecutive genes with nonzero values represent CN length, and negative or positive segmentation values represent loss and gain, respectively. We hypothesize that Conv-LSTM and CAE networks will be able to identify those hidden nonlinear features.

3.4 Feature extraction based on oncogenes

There are structural variations in DNA where protein-coding genes are not present, and these genes have no effect on tumor growth. One way to remove such unnecessary segments is by keeping only those CNs that fall within gene regions and discarding the rest using oncogenes, because oncogenes not only control the cell division but also are responsible for GE changes due to external or natural causes and could lead to the development of tumor at some point. Thus, by considering only 568 oncogenes, we prepared another CNV representation instead of taking the whole 20,308 genes.

A manual CNV feature extraction is inefficient since recurrent CNVs from multiple samples are often ignored. Thereupon, we used the MSeq-CNV tool [32], which is an efficient CNV extraction method. In order to model the number of mate pairs with aberration in the insertion size and Poisson distribution for emitting the read counts, MSeq-CNV applies the binomial distribution-based mixture density on each genomic position. This applies a mixture density for modeling aberrations in depth of coverage and abnormalities in the mate pair insertion sizes. After applying MSeq-CNV, genome-wide copy gain and loss regions are saved in a matrix form, where each row represents a different CNV call with its sample number, start position, end position, and CN estimation.

Fig. 1
figure 1

An example CNV selection process from a set of candidate CNs from a reference segment of chromosome 1 that extends from one DNA location to another location

An example of the CNV selection process is shown in Fig. 1, where a reference segment of chromosome 1 on top extends from DNA location 109–1145 spanning over two genes: A and B. Below this reference, there are six candidate CNs (C1–C6) at this location. C1 is a candidate CN that extends between genes A and B, but C1 does not overlap with any one of them, which is why we remove it from the calculation. Candidates C2 and C3 are CNs, too, which slightly overlap with A and B, respectively. We still consider them as a valid CN because gene positions are not always fixed. C4 and C5 both fully overlap with genes A and B; hence, they are also valid CNs. C6 is not only a big segmental CN, but it also overlaps with both genes. However, since we consider only the part of genes that overlap, we discard rest of the regions from the calculation. The calculation is based on the hypothesis that if a candidate CN overlaps with a gene even just by one base pair, we still consider it as a valid CN [35].

Table 3 Example of training data prepared based on CNVs representation of oncogenes

Each gene has a start and end positions in DNA, which is similar to CNs having starting and ending positions. For each gene, MSeq-CNV checks whether a sample has any CNV region overlapping with that gene. Then, it assigns the segment value for that gene; otherwise, 0 is assigned, i.e., a tumor sample has different CN variations than a reference, which is measured by the segment value. If the value is positive, CN in that gene is considered to be more increased than the reference. A negative value, however, signifies that the gene was deleted from that DNA location. A sample of the final training data is shown in Table 3. The first row shows a male patient’s GBM tumor sample, where genes PRDM16 and RPL22 have CN losses with segmentation values of 0.3195 and 0.2154, respectively, whereas gene CAMTA1 has no CNVs. Eventually, our training data dimension reduced from 20,308 to 568 and the label column.

3.5 Network constructions and training

We construct and train the Conv-LSTM and CAE networks using two different CNVs representations based on oncogenes and protein-coding genes, in which multiple snapshot models are created. We evaluate each trained model independently, which is followed by the model averaging ensemble (MAE) to combine the snapshot models of both networks.

However, to apply the convolutional (conv) operations and to find important biomarkers using a guided-gradient class activation map (aka. GradCAM), we embedded each CNVs sample into a 2D image. Each protein-coding gene-based CNV sample was reshaped from a \(20,308 \times 1\) array into a \(144 \times 144\) image, whereas each oncogene-based CNV sample was reshaped from a \(568 \times 1\) array into \(24 \times 24\) image by adding zero padding around the edges. Then, all the images are normalized to [0,255] without losing any information.

3.5.1 Conv-LSTM network

We construct and train the Conv-LSTM network by combining CNN and LSTM layers in a single architectures as shown in Fig. 2. While the CNN uses conv filters to capture local relationship values, an LSTM network can carry overall relationship of a whole CNV sequence more efficiently. This turns Conv-LSTM into a powerful architecture to capture long-term dependencies between features extracted by CNN. To do so, all the input \({\mathcal {X}}_{1},{\mathcal {X}}_{2},\ldots ,{\mathcal {X}}_{t}\) cells in Conv-LSTM output \({\mathcal {C}}_{1},{\mathcal {C}}_{2},\ldots ,{\mathcal {C}}_{t}\), hidden states \({\mathcal {H}}_{1}, {\mathcal {H}}_{2},\ldots , {\mathcal {H}}_{t}\), and gates \(i_t\),\(f_t\),\(o_t\) of the network as 3D tensors whose last two dimensions are spatial dimensions.

Then, the Conv-LSTM determines the future state of a certain cell in the input hyperspace by the inputs and past states of its local neighbors, which is achieved by using a convolution operator in the state-to-state and input-to-state transition as follows [49]:

$$\begin{aligned} \left\{ \begin{array}{l} i_{t} =\sigma \left( W_{x i} * {\mathcal {X}}_{t}+W_{h i} * {\mathcal {H}}_{t-1}+W_{c i} \circ {\mathcal {C}}_{t-1}+b_{i}\right) \\ f_{t} =\sigma \left( W_{x f} * {\mathcal {X}}_{t}+W_{h f} * {\mathcal {H}}_{t-1}+W_{c f} \circ {\mathcal {C}}_{t-1}+b_{f}\right) \\ {\mathcal {C}}_{t} =f_{t} \circ {\mathcal {C}}_{t-1}+i_{t} \circ \tanh \left( W_{x c} * {\mathcal {X}}_{t}+W_{h c} * {\mathcal {H}}_{t-1}+b_{c}\right) \\ o_{t} =\sigma \left( W_{x o} * {\mathcal {X}}_{t}+W_{h o} * {\mathcal {H}}_{t-1}+W_{c o} \circ {\mathcal {C}}_{t}+b_{o}\right) \\ {\mathcal {H}}_{t} =o_{t} \circ \tanh \left( {\mathcal {C}}_{t}\right) \end{array}\right. \end{aligned}$$
(1)

In Eq. 1, ‘*’ denotes the conv operation and ‘o’ is the entrywise multiplication of two matrices of same dimensions. Intuitively, an LSTM layer treats an input feature space as timesteps and outputs arbitrary hidden units per timestep. The second LSTM layer emits an output ‘H,’ which is then reshaped into a feature sequence to feed into fully connected layers to predict the cancer types at the next step, which is again used as an input at the next time step. Once the input feature space is passed to the conv layer, the input is padded such that the output has the same length as the input.

Fig. 2
figure 2

The schematic representation of the Conv-LSTM network, which starts from taking input CNV samples and passing to CNN layers before getting a sequence vector representation from the LSTM layers to pass to dense, dropout, and softmax layers for the cancer-type prediction

Output of each conv layer is then passed to the dropout layer to regularize learning to avoid overfitting [25]. This involves the input feature space into lower-dimensional representation, which is then further down-sampled by two different max pooling layers and a max pooling layer (MPL) by setting the pool size, where the output of an MPL can be considered as an ‘extracted feature’ from each 2D CNV image. Since each MPL follows to ‘flatten’ the output space by taking the highest value in each timestep dimension, this helps produce a sequence vector (i.e., feature sequence) from the last LSTM layer, which will hopefully force the CNVs of specific genes that are highly indicative of being responsible for specific cancer type. Then, this vector is fed into the neural network after passing through another dropout layer and a fully connected Softmax for the probability distribution over the classes.

3.5.2 Convolutional autoencoder classifier

Using the Conv-LSTM-based network, we have seen how to extract both local and globally important features for classifying individual samples from a limited number of labeled samples. However, in cases of very small numbers of training samples, unsupervised pretraining has proven highly effective [10, 16, 21, 47]. Further, since both datasets are very sparse, we hypothesize that using a CAE-based representation learning (with reduced dimension) and classification scheme should further enhance the classification accuracy.

Fig. 3
figure 3

The schematic representation of the CAE classifier, which starts from taking input CNV samples and passing to CNN layers before getting a sequence vector representation to pass to dense, dropout, and softmax layers

The architecture of the CAE used in this work consists of a 19-layer CAE. A batch normalization layer is placed after every conv layer to normalize the inputs to the next layer in order to fight the internal covariate shift problem. The ReLU activation function is used in every layer except for the last layer that uses a softmax activation function. From the given CNV samples, a conv layer in the encoder calculates the feature map. The pooling layer is calculated by down-sampling the conv layer by taking the maximum value in each non-overlapping subregion. The CAE classifier contains two parts: (i) the CAE and (ii) the classifier. The CAE part has the following structure (Fig. 3):

  • Input: each CNVs sample is reshaped to \(144 \times 144\)

  • Convolutional layer: of size \(32 \times 144 \times 144\)

  • Batch normalization layer: of size \(32 \times 20,736\)

  • Convolutional layer: of size \(32 \times 20,736\)

  • Batch normalization layer: of size \(32 \times 10,368\)

  • Max pooling layer: of size \(32 \times 10,368\)

  • Convolutional layer: of size \(64 \times 10,368\)

  • Batch normalization layer: of size \(64 \times 10,368\)

  • Convolutional layer: of size \(64 \times 10,368\)

  • Batch normalization layer: of size \(64 \times 10,368\)

  • Convolutional layer: of size \(128 \times 10,368\)

  • Batch normalization layer: of size \(128 \times 10,368\)

  • Convolutional layer: of size \(128 \times 10,368\)

  • Batch normalization layer: of size \(128 \times 10,368\)

  • Convolutional layer: of size \(128 \times 10,368\)

  • Batch normalization layer: of size \(128 \times 10,368\)

  • Convolutional layer: of size \(64 \times 10,368\)

  • Batch normalization layer: of size \(64 \times 10,368\)

  • Upsampling layer: of size \(64 \times 20,736\)

  • Convolutional layer: of size \(1 \times 20,736\).

After training the CAE, we remove the decoder components by making the first 19 layers trainable false, since the encoder part is already trained. On top of these components, we add a flattening layer, followed by a fully connected dense layer of size 128, followed by the softmax output unit of 14, i.e., number of classes.

3.5.3 Ensemble of classifiers

The model ensemble technique helps the neural network to achieve improved performance compared to the predictions from a single model by reducing the generalization error. It can be achieved by training multiple model snapshots during a single training run and combining their predictions to make an ensemble prediction called snapshot ensemble [18]. A limitation of this approach, however, might be that the saved models will be similar, resulting in similar predictions and predictions errors. This will not offer much benefit from combining their predictions unless we have already introduced diversities during model training [18].

Fig. 4
figure 4

Training loss of Conv-LSTM network using standard learning rate (blue) and \(M = 50\) cosine annealing cycles (red). The intermediate models, denoted by the dotted lines, form an ensemble at the end of training

This issue can be addressed by changing the learning algorithm to force the exploration of different network weights during a single training run that will result, in turn, with models that have differing performance. One way to achieve this is using cyclic annealing cosine (CAC), which aggressively but systematically changes the learning rate (LR) over training epochs to produce very different network weights [30]. The CAC approach requires total training epochs, maximum learning rate, and number of cycles, as well as the current epoch number making the initial LR and the total number of training epochs as two hyperparameters. Therefore, CAC will have an effect of starting with a large LR, which is relatively rapidly decreased to a minimum value before being dramatically increased again to the following LR for the given epoch [18]:

$$\begin{aligned} \alpha (t)=\frac{\alpha _{0}}{2}\left( \cos \left( \frac{\pi \bmod (t-1,\lceil T / M\rceil )}{\lceil T / M\rceil }\right) +1\right) \end{aligned}$$
(2)
Fig. 5
figure 5

Model averaging ensemble of Conv-LSTM and CAE snapshots

In Eq. 2, \(\alpha (t)\) is the LR at epoch t, \(\alpha _0\) is the maximum LR, T is the total epoch, M is the number of cycles, mod is the modulo operation, and square brackets indicate floor operations. After training the network for M cycles, best weights at the bottom of each cycle are saved as the snapshot of each model, which gives us M model snapshots. The model converges to a minimum at the end of training. The model undergoes several LR annealing cycles, converging to and escaping from multiple local minima during snapshot ensembling.

The overall process is outlined in Fig. 4. Finally, all the snapshots are collected together and used in the final ensemble using MEA technique as shown in Fig. 5. Although each model’s weights are subjected to the dramatic changes during training for the subsequent LR cycle, CAC allows the learning algorithm to converge to a different solution.

3.5.4 Networks training

We train the networks in two steps: (i) First, we train each network independently and (ii) we create multiple snapshots of both networks before applying the MAE. While training the Conv-LSTM and CAE networks, network parameters were initialized with Xavier initialization [15] and trained using a first-order gradient-based optimization technique called AdaMax to optimize the categorical cross-entropy loss shown in Eq. 3 of the predicted cancer type P vs. actual cancer type T:

$$\begin{aligned} E_{c} = \sum _{i, j, k} T_{i, j, k} \log P_{i, j, k}+\left( 1-T_{i, j, k}\right) \log \left( 1-P_{i, j, k}\right) \end{aligned}$$
(3)

On the other hand, the Softmax activation function is used in the output layer for the probability distribution over the classes. The hyperparameters are defined in grid search and tenfold cross-validation by varying learning rates and different batch sizes. Further, we observe the performance by adding the Gaussian noise layers followed by Conv and LSTM layers to improve model generalization and reduce overfitting. When we train the ensemble model, we set total epochs, maximum learning rate, number of cycles, and the current epoch number, but we make the initial learning rate and the total number of epochs two hyperparameters as expected by the CA approach.

$$\begin{aligned} h_{\mathrm {MAE}}=\frac{1}{m} \sum _{0}^{m-1} h_{M-i}(\mathrm {x}) \end{aligned}$$
(4)

After training each network for certain cycles, the best weights are saved, giving model snapshots. The ensemble prediction at test time is the average of the last m model’s outputs, where \(m \le M\). If \({\mathbf {x}}\) is a test sample and \(h_i \left( {\mathbf {x}} \right)\) is the softmax score of snapshot i, the final output is the simple mean of the last m models shown in Eq. 4, giving the lowest test error [18].

3.5.5 Finding and validating important biomarkers

Since the data are very high dimensional, we chose not to go for manual feature selection. Rather, we let both CAE and Conv-LSTM networks extract the most important features inspired by literature [23]. The guided backpropagation helps to generate more human-interpretable but fewer class-sensitive visualizations than the saliency maps (SM) [38]. Since SM use true gradients, the trained weights are likely to impose a stronger bias toward specific subsets of the input pixels. Accordingly, class-relevant pixels are highlighted rather than producing random noise [38]. Therefore, GradCAM++ is used to draw the HM to provide attention to most important genes. Class-specific weights of each FM are collected from the final conv layer through globally averaged gradients (GAG) of FMs instead of pooling [6]:

$$\begin{aligned} \alpha _k^c=\frac{1}{Z}\sum _{i}\sum _{j}\frac{\partial y^c}{\partial A_{ij}^k} \end{aligned}$$
(5)

where Z is the number of pixels in a FM, c is the gradient of the class, and \(A_{ij}^k\) is the value of kth FM at (ij). Having gathered relative weights, the coarse SM, \(L^c\) is computed as the weighted sum of \(\alpha _k^c*A_{ij}^k\) of the ReLU activation function and employ the linear combination to the FM, since only the features with positive influence on the class are of interest [6].

$$\begin{aligned} L^{c}= {ReLU}\left( \sum _{i} \alpha _{k}^{c} A^{k}\right) \end{aligned}$$
(6)

The GradCAM++ replaces the GAG with a weighted average of the pixel-wise gradients (Eq. 7), since the weights of pixels contribute to the final prediction (Eq. 8) by aggregating Eq. 7 and \(\alpha _{ij}^{kc}\) (Eq. 9). In summary, it applies the following iterators over the same activation map \(A^k\), (ij) and (ab):

$$\begin{aligned} w_{k}^{c}= & {} \sum _{i} \sum _{j} \alpha _{i j}^{k c} \cdot {ReLU}\left( \frac{\partial y^{c}}{\partial A_{i j}^{k}}\right) \end{aligned}$$
(7)
$$\begin{aligned} y^c= & {} \sum _{k}w_k^c\cdot \sum _{i}\sum _{j}A_{ij}^k \end{aligned}$$
(8)
$$\begin{aligned} \alpha _{ij}^{kc}= & {} \frac{\frac{\partial ^2y^c}{(\partial A_{ij}^k)^2}}{2\frac{\partial ^2y^c}{(\partial A_{ij}^k)^2}+\sum _{a}\sum _{b}A_{ab}^k{\frac{\partial ^3y^c}{\{(\partial A_{ij}^k)^3\}}}} \end{aligned}$$
(9)
figure a
figure b
Fig. 6
figure 6

Schematic representation of important biomarkers identification, which starts from taking a raw CNV sample and passing to conv layers before obtaining rectified conv feature maps to be used to set and record the activation maps in the forward pass, and gradient maps in the backpropagation to put attention using heatmap

Consequently, they can highlight class-relevant pixels rather than produce random noise [38]. This further motivated us to use the GradCAM++ [6] to draw the heat map to provide attention to the most important genes and related CNV in their locations. Since GradCAM++ requires all the samples to run through the neural network once, we let the trained CAE model set and record the activation maps in the forward pass, and the gradient maps in the backpropagation collect the heat map for each sample from the trained model.

The idea is also to find most important biomarkers by ranking them based on mean absolute impact (MAI) threshold. For a batch of samples, first we compute the loss for a column across the samples. Then, we shuffled that column and compute the loss again. The feature importance is then given by the difference between the loss obtained when the respective column is shuffled and the original loss. Then, the MAI is computed as the mean of the \(F_i\) across the batches. The idea is that when the column is shuffled, the relationship between that feature and the output of the model is broken. Therefore, if the loss increases a lot, making the model predictions are less accurate, then there is high chance that that variable could be an important predictor. Conversely, if the loss remains almost equal, that predictor is not contributing significantly toward increasing or decreasing the accuracy. Hence, it can be considered as a less important feature.

The second approach we could employ is by first computing the gradient for a class for the last conv layer and using the ReLU activation function to normalize the gradient between 0 and 1, which is the importance. However, if the gradient is negative, we would have to change it to 0, which would force to keep track of weak CNV features. Algorithms 1 and 2 show the pseudocode used for computing feature importance with ranking genes and for identification of the important areas on the heat maps, while Fig. 6 shows the schematic representation of both step.

3.5.6 Hyperparameter tuning

Since an appropriate selection of hyperparameters can have a huge impact on the performance of deep architecture, we perform the hyperparameter optimization through a random search and fivefold cross-validation tests. First, we optimize the parameters for the individual model, followed by the optimization of the ensemble model. While training the Conv-LSTM and CAE, in each of five runs, 70% of the data are used for the training, 30% of the data are used for the evaluation, and 10% of data from the training set are used for the validation of networks to optimize the learning rate, batch size, number of filters, kernel size, dropout probability, and Gaussian noise probability.

However, for pretraining the CAE, 90% of data are used for the training and 10% of the data are used for validation. When we train the ensemble model, we set number of epochs (NE) to 500, maximum learning rate to 1.0, and number of cycles (NC) to NE/10, giving 50 snapshots for each model for 50 cycles through a grid search, since an exhaustive optimization would be too computationally expensive. Further, we only use a single object property to test how results change with each choice of parameter, due to computational constraints.

However, to get the class-specific heat map, we average all the normalized heat maps from the same class, as suggested in the literature [44]. In the heat map, a higher intensity pixel represents a higher significance to the final prediction, which indicates higher importance of its corresponding gene and its CNV. Finally, top genes are selected based on the intensity rankings in the heat maps as described in Algorithms 1, 2, and 6.

4 Experiment results

We carried out several experiments based on protein-coding genes and oncogenes. By using each network and CNV representation, cancer-type predictions were experimented separately. Results of each experiment with a comparative analysis will be discussed in this section.

Table 4 Performance on cancer-type prediction using noncoding genes

4.1 Experiment setup

All programs were implementation in PythonFootnote 2. The software stack comprised Scikit-learn and Keras with the TensorFlow backend. The network was trained on a computer with an i7 CPU core, 32GB of RAM, Ubuntu 16.04 OS, and Nvidia GTX 1050i GPU with CUDA and cuDNN enabled to make the overall pipeline faster. Results based on the best hyperparameters produced through random search are reported empirically.

We report the results using macro-averaged precision and recall since classes are imbalanced. We did not report F1-scores since it is significant only when the value of precision and recall is very different. Further, for cancer diagnosis, it is important to have both high precision and recall. Hence, results with very different precision and recall are not useful in cancer diagnosing and tumor-type identification. Finally, the MAE is used to report the final predictions.

4.2 Performance analysis of individual model

Classification accuracies using oncogenes and all noncoding genes vary. In particular, using protein-coding genes, classifiers perform moderately well, giving accuracies of 72.96% and 76.77% with Conv-LSTM and CAE network, respectively. Since the classes are imbalanced, only the accuracy will give a very distorted estimation of the cancer types. Class-specific classification reports are thus shown in Tables 4 and 5.

As shown in Table 4, precision and recall for the majority of cancer types were moderately high in which CAE performed mostly better than Conv-LSTM network. In particular, CAE classifier can classify COAD, GBM, KIRC, BRCA, LUSC, and PRAD more confidently (at least 82.50% of the cases). On the contrary, Conv-LSTM classified the OV tumor samples more accurately than CAE classifier (83.56% vs 76.13%). The downside is that both classifiers have made a substantial amount of mistakes, e.g., CAE can classify STAD and UCEC tumor cases only 65% and 68% accurately, whereas the Conv-LSTM network made more mistakes particularly for the STAD, BLCA, THCA, UCEC, LUAD, and LGG tumor samples.

Table 5 Performance on cancer-type prediction using oncogenes

Although the tumor-specific classification accuracies varied across classes, the overall accuracies increased moderately and reached 74.25% and 78.32% using Conv-LSTM and CAE, respectively. As shown in Table 5, precision and recall scores for most of the cancer types are higher than that of protein-coding gene-based CNVs. Similar to the previous experiment, CAE performed mostly better than the Conv-LSTM network across tumor types. In particular, the CAE classifier can correctly classify COAD, GBM, OV, UCEC, PRAD, and BLCA in at least 80% of the cases showing higher confidence.

Fig. 7
figure 7

The ROC curves for Conv-LSTM and CAE models across folds

On the contrary, Conv-LSTM classified the BRCA and THCA tumor samples more accurately than the CAE classifier. The downside is that both classifiers have made substantial mistakes, too. For example, CAE could classify HNSC and LUSC tumor samples accurately in only 69% and 72% of the cases, respectively, whereas the Conv-LSTM network made more mistakes, particularly for the STAD, HNSC, LUSC, and LGG tumor samples. In summary, both classifiers performed moderately well except for certain types of tumor cases such as STAD, HNSC, BLCA, THCA, UCEC, LUAD, LUSC, and LGG. The ROC curves in Fig. 7 based on CNVs from oncogene show that AUC scores generated by both the Conv-LSTM model and CAE classifier are consistent across folds, and AUC scores generated by the CAE classifier are about 4% better than that of the Conv-LSTM network.

4.3 Performance analysis of the ensemble model

The ensemble classifier shows about 2% and 4% performance boost across classes and overall in the cases of oncogenes and protein-coding gene, respectively. The accuracy looks easy enough but makes no distinction between classes, e.g., when a doctor makes a medical diagnosis that a patient has cancer, in reality, the patient does not (i.e., false positive). This has very different consequences than making the call that a patient does not have cancer when he actually does (i.e., false negative). A more detailed breakdown of correct and incorrect classifications for each class using the confusion matrices is shown in Fig. 8. The rows of the matrix correspond to ground truth labels, and the columns represent predictions.

Fig. 8
figure 8

Confusion matrices of the ensemble classifier

There have been significant accuracy improvements in the case of STAD, HNSC, BLCA, THCA, UCEC, and LUSC tumor types, e.g., for the UCEC and LUSC cancer types, not only the accuracies increased slightly, but also the false positive rates reduced significantly. On the contrary, using oncogenes, classifiers were less confused among tumor types, except for the LGG, HNSC, and THCA tumor types. Despite these improvements, there are still high confusion among LUAD, HNSC, and LGG, which is because features from these tumor samples are highly correlated. We could not find an exact reason for such behavior, and a more gene-specific CNV analysis is required to confirm such correlations.

4.4 Validation of the top biomarkers

We report the validation and feature ranking only for the protein-coding gene-based CNVs only. The reason is that the number of protein-coding genes is much higher than that of oncogenes and finding important biomarkers from this large feature space is more challenging. Figure 9 shows some heat maps generated for each class showing similarities across fivefold and displaying a distinct pattern when comparing different cancer types, where each row represents heat maps of one cancer type and columns represent heat maps across five different folds. Circled regions show the similar pattern in all the folds. Although there are some differences among different folds, some patterns are clearly visible.

Table 6 Top five genes and their importance using protein-coding gene-based CNVs
Fig. 9
figure 9

Heat map examples for selected cancer types. Each column represents the result from one fold. Rows represent the heat maps of BRCA, KIRC, COAD, LUAD, and PRAD cancer types (from top-down)

Since we already identified top genes for which the change in CNV to have significant impacts, only five tumor types (i.e., BRCA, KIRC, COAD, LUAD, and PRAD) have at least five genes having feature importance of at least 0.5. The related genes in the top 20 can then be viewed as tumor-specific biomarkers, which contribute most toward making the prediction by the neural networks. Then, top five biomarkers and their importance, MAI higher than 0.5, are shown in Table 6. As for the other nine tumor types, only three genes managed to make in the list and are not reported here.

To further validate our findings, the saturation analysis of cancer genes across 13 tumor types (except for COAD) is obtained from the TumorPortalFootnote 3 [29]. Validation for the COAD cancer is done based on the signature-based approach [54], which was used for predicting the progression of colorectal cancer. However, our approach makes some false identifications, as 21 out of 25 genes are validated to be correct.

4.5 Analysis of the common biomarkers

We report the top ten common biomarkers while extracting the feature importance in Fig. 10 across 14 cancer types. As seen, KRTAP1-1, INPP5K, GAS8, MC1R, POLR2A, BET1P1, NAT2, PSD3, KAT6A, and INTS10 genes are common across all the cancer types giving INTS10, a protein-coding gene, the highest feature importance of 0.6.

Fig. 10
figure 10

Common genes across 14 cancer types

Importantly, identifying all significant common gene sets for a specific cancer type, e.g., BRCA, will help to understand various aspects of BRCA carcinogenesis. Thus, these top genes have very close relations to the corresponding tumor types, which could be viewed as potential biomarkers. However, a more detailed analysis of biological signaling pathways is needed to further validate these findings.

4.6 Comparisons with related works

Two previous studies focused on using CNVs datasets [13, 53] from cBioPortal for Cancer Genomics. They applied incremental feature selection methods to extract CNV features to train classic ML models and managed to achieve good classification accuracy, i.e., 75% and 85% by Zhang et al. and Sanaa et al., respectively.

Since our datasets are collected from TCGA and have more samples that are used to train DL algorithms, a one-to-one comparison was not viable. Our proposed approach achieved 80% classification accuracy, which is about 6% better than our previous approach [22] in terms of every metric. Nevertheless, we attempted to provide more human-interpretable interpretation by identifying statistically significance biomarkers. These analyses are further validated with scientific literature, which confirms that the identified genes are biologically relevant and significant CN changes to these gene’s location may have significant impact on cancer patients.

5 Conclusion

In this paper, we proposed a snapshot neural ensemble method for cancer-type prediction based on CNVs data using two deep architectures called Conv-LSTM and CAE. Our ensemble method is based on cosine annealing techniques, which create multiple model snapshots of these networks. Then, we apply the MAE technique to combine the predictive power of these architectures. Experimental results show that CNVs are not only useful for predicting certain types of cancer but also show an obvious association with cancer growth.

Considering an accuracy of 80% for cancer-type prediction, however, we cannot claim that our model’s confidence is very high. This is for several reasons, such as lack of training samples, poor training, and data sparsity. The latter reason caused each network to produce high training and validation loss of error optimization during the training phase, giving lower accuracy. Data sparsity evolved because, on average, only 422 samples per tumor were used for the training. This is fairly low for such high-dimensional complex data. As a result, models were not trained well enough to identify subtle differences among tumors types.

We lacked enough training samples compared to the complex dataset, as deep architectures usually expect more samples to get trained well. For the same reason, there was a lack of samples to perform model validation during fivefold cross-validation. Another reason for poor training might be an imbalanced dataset. As we can see, some cancer types, such as BRCA, have 2.5 times more training samples than other tumors. For example, STAD and BLCA have at least 15% sample difference with other tumor types. These two tumors have very low classification accuracy, where about half of the samples were misclassified.

Further, depending on the genome sequencing and CNV calling algorithms, different CNs with different CNV lengths and segmentation means might be produced and the results may change significantly. Although we attempted opening the black-box model through the gene validation and feature ranking, a more detailed analysis on biological signaling pathways is needed to further validate these findings. Since multiple factors are involved in cancer diagnosis, e.g., estrogen receptor (ER), progesterone receptor (PGR), and human epidermal growth factor receptor 2 (HER2/neu) statuses for breast cancer, providing AI-based diagnoses might not be accurate solely based on CNVs. This requires using multimodal features based on DNA methylation, GE, miRNA expression, and CNVs data by creating a multiplatform network to support each data type.

Lastly, a DL approach using CNV data along with other types of genomics data from different cohorts such as DNA methylation, GE, and somatic mutations will be more reliable. In future, we intend to extend this work by alleviating the above limitations, such as increasing the number of samples for training, testing, and validation, by combining samples from other sources, such as ICGC and COSMIC.