1 Introduction

Many notable works have so far been carried out to classify malignant tumors by using different machine learning techniques [45]. Most of the studies have tried to predict whether a tumor is benign or malignant [62]. Their subjects have also been homogeneous in terms of tumor origin and scanner modalities [41]. Thus, domain-centric prediction of tumors has been much exercised than to propose an algorithm for predicting a tumor accurately irrespective of its originating organ. The real-life prognosis of a tumor is much more complicated than just to predict it as benign or malignant. The American Joint Committee on Cancer (AJCC) [18] has propounded the popular Tumor-Node-Metastasis (TNM) staging system that depicts how much a tumor has already spread in the body. Table 1 shows an example of how the TNM staging of renal cancers has been accomplished.

Table 1 Definition of TNM staging of renal tumors as per AJCC 7th Edition manual

The TNM stage may vary from one tumor-type to another. The overall pathological staging brings TNM stages under a uniform prognostic group. In Table 2, the overall AJCC staging of renal tumors has been shown. The present study has proposed a model that can predict the overall pathological staging of a tumor irrespective of its genre.

Table 2 Overall AJCC staging of renal tumors as per AJCC 7th Edition manual

The deep neural network (DNN) has been a prevalent technology in computer vision and biomedical image processing. DNN has a wide variety of applications ranging from the estimation of blood pressure [52, 70] to the detection of COVID-19 infections [1, 2]. Unlike the traditional machine learning algorithms, explicit image pre-processing [37], segmentation [38], and manual feature crafting are not required in deep learning. Thus, image processing and re-generation becomes easier and effective with deep learning techniques [44]. However, it has been observed that the traditional sequential models might mislay important imagery features during the down-sampling phase. The sequential models also suffer from the high variance that may affect generating a consistent accuracy. As the current problem at hand is a multiclass problem having heterogeneous imagery, the present study has adopted a non-sequential paradigm of deep learning. The model has been developed by combining branched, re-injected, and bidirectional recurrent layers [13]. The success of the model would pave a new revolution in cancer treatment as there would be no need to rely on different models for staging different cancer. A single model would work as a decision support system for staging tumors of different genres and would help radiologists to affirmatively decide on the treatment plan.

2 Objective

The study aims to first prepare an image collection containing eight different cancers: bladder, liver, renal, head & neck, breast, thyroid, uterus, and lungs. These cancers have been selected as they belong to the leading cause of cancer-related deaths both in developed, developing, and under-developed countries [10]. In this way, the image dataset prepared would have a varied mix of tumor images as per originating organs, imaging modalities, subject demography, and treatment strategy. The next aim is to develop a deep neural network model capable of classifying the AJCC staging of such a varied mix of tumors. The problem at hand is complex relative to other contemporary efforts where homogeneous image collections have been classified. The present study conducts experiments with both sequential and non-sequential models. The final aim is to compare the results obtained from different models and to select the best model. The rest of the study is divided into the following sections: related work, data acquisition, methodology, discussion, and conclusion.

3 Related work

Recent notable studies have been included in the review of the literature to compare the limitations and to bridge the research gap.

Tables 3, 4, 5, 6, 7, 8, 9, 10 reveals that, on a considerable number of occasions, non-invasive approaches have successfully overshadowed the in-vitro diagnosis of tumors. Machine learning, especially, deep learning has emerged as a seminal technique for CAD-based tumor prognosis. It has also been found that a model ensemble performs better than a single model. Most of the Researchers have so far concentrated on the classification of a single tumor genre. This has elevated the performance of domain-specific classification of tumors. However, the initiative to automate pathological staging has not been seen very often. Existing studies have mostly been engaged in distinguishing between benign or malignant tumors. Accuracy levels dropped whenever the problem at hand went beyond simple binary classification. Many of the studies have been semi-automated where manual feature extraction created significant processing overhead. The use of transfer learning has created resource-consuming architecture in many of the recent studies. Many research works relied on a single database for carrying out the learning process and ended up with less trustworthy results. Although the efforts to classify histological subtypes or grading have been identified in some cases, they are also confined to some particular tumor-type. It has also been observed that many of the studies have considered a single scanner modality. As a result, the existing studies have created different models that may detect a particular type of tumor having a certain type of scanner modality. Thus, there is a great scope for developing a model that can identify different tumors having dissimilar scanner modalities. The present study bridges the research gaps found in the related studies and proposes a new emerging scientific model for automated detection of malignant tumors of different genres.

Table 3 Recent significant studies on bladder cancer
Table 4 Recent significant studies on liver cancer
Table 5 Recent significant studies on renal cancer
Table 6 Recent significant studies on head & neck cancer
Table 7 Recent significant studies on breast cancer
Table 8 Recent significant studies on thyroid cancer
Table 9 Recent significant studies on uterine cancer
Table 10 Recent significant studies on lung cancer

4 Data acquisition

A dataset is prepared from eight different datasets from The Cancer Imaging Archive (TCIA) [14] representing different tumors. TCGA-BLCA dataset represents Urothelial Bladder Carcinoma (BLCA). The dataset comprises 111,781 images of 120 numbers of patients. Major imaging modalities are Computed Tomography (CT), Magnetic Resonance (MR), Computed Radiography (CR), Positron Emission Tomography (PET), and Digital Radiography (DX). TCGA-KIRP depicts cervical renal papillary cell carcinoma. It has 33 cases comprising 376 series and 26,667 images. Major imaging modalities are CT, MR, and PT. TCGA-LIHC is the Liver Hepatocellular Carcinoma (LIHC) image dataset. It has 97 cases with 1688 series having a total number of 125,397 images. Major imaging modalities are CT, MR, and PT. Non-Small Cell Lung Cancer (NSCLC) radiogenomics dataset has a cohort of 211 subjects. The dataset comprises Computed Tomography (CT), and Positron Emission Tomography (PET)/CT images. TCGA-THCA represents thyroid cancer, having 6 cases in the image set with 28 series and 2780 numbers of images. Major imaging modalities are CT and PET. TCGA-UCEC represents the Uterine Corpus Endometrial Carcinoma. There are 65 cases including 912 series having 75,829 images. Major imaging modalities are CT, CR, MR, and PT. Head & Neck radiomics collection contains clinical data and computed tomography (CT) from 137 head and neck squamous cell carcinoma (HNSCC) patients treated by radiotherapy. TCGA-BRCA represents Breast Invasive Carcinoma. It has 164 cases with 1877 series containing 230, 167 images. Imaging modalities are MR and mammography (MG).

The final image acquisition has been carried out by retrieving images from all the aforementioned collections (Fig. 1). Each subject with pre-surgical DICOM images stored in TCIA is identified with a Patient ID that is identical to the Patient ID of the same subject in TCGA. Twenty best scans from each case having pathological data from all the eight collections have been taken to form the final image collection. In this way, 717 cases where the supportive clinical and pathological data are available to have been considered, and from each such case thirty best scans are extracted. Thus, 14,340 radiological images have been collected to form the new image dataset. This newly prepared image collection is heterogeneous to image modalities, cancer types, cancer stages/grades, and demographic characteristics of patients.

Fig. 1
figure 1

Glimpse of the final image collection (1st row represents TCGA-BRCA, 2nd row represents Head & Neck Radiomics, 3rd row represents TCGA-BLCA, 4th row represents TCGA-KIRP, 5th row represents TCGA-LIHC, 6th row represents TCGA-UCEC, 7th row represents TCGA-THCA and 8th row represents NSCLC radiogenomics)

5 Methodology

Equations 1 and 2 represent two key techniques used in the study, namely, branching and re-injection [4], respectively.

$${F}^{(i)}\left(W,X\right)={O}_l^{(i)}=f\left({W}_l^{(i)}{X}_l+{b}_l^{(i)}\right)$$
(1)

Here X is the input vector; W is the weight vector; b is the bias; l is the corresponding layer number; O is the output; i is the branch number and f (…) is the non-linear activation function.

$${O}_n^{(k)}={O}_l^{(i)}+{O}_m^{(j)}$$
(2)

Where i, j, k are different branches of different convolutional layers l, m, n (n > l ≥ m, and k ≠ i ≠ j).

The concatenation of all the branches [46] is done by using Eq. 3:

$$Y={F}^{\prime}\left(W,X\right)={g}_c\left(\left\{{F}^{(j)}\left(W,X\right)\right\}\right)$$
(3)

In Eq. 3, {F(j)(…)} is the collection of output tensors emanating from j (j = 1, 2… n; n > 0) branches and gc is the concatenation operation via the axis c of the tensor.

Bidirectional LSTM [69] is described in Eq. 4:

$${\displaystyle \begin{array}{*{20}c}\begin{array}{*{20}c}{\overrightarrow{H}}_t=\varnothing \left({X}_{t\kern0.5em }{W}_{xh}^{(f)}+{\overrightarrow{H}}_{t-1\kern0.5em }{W}_{hh}^{(f)}+{b}_h^{(f)}\right)\\ {}{\overleftarrow{H}}_t=\varnothing \left({X}_{t\kern0.5em }{W}_{xh}^{(b)}+{\overleftarrow{H}}_{t+1\kern0.5em }{W}_{hh}^{(b)}+{b}_h^{(b)}\right)\end{array}\\ {}{y}_t=g\left({W}_y\left[{\overrightarrow{H}}_{t-1\kern0.5em },{\overleftarrow{H}}_{t+1\kern0.5em }\right]+{b}_y\right)\end{array}}$$
(4)

Where t is the timestamp; Xt is the mini-batch input; h is the number of hidden units; \({\overrightarrow{H}}_t\)is the forward and \({\overleftarrow{H}}_t\)is the backward hidden states; ∅ is the layer activation function.

Dense layers [67] are expressed in Eq. 5:

$${\mathrm{Y}}_{\mathrm{i}}={\mathrm{Wy}}_{\mathrm{t}}+\mathrm{b}$$
(5)

Softmax [64] is used for detecting the class scores from the final layer outcome (Eq. 6):

$${\hat{\mathrm{Y}}}_{\mathrm{i}}=\mathrm{softmax}\ \left({\mathrm{Y}}_{\mathrm{i}}\right)$$
(6)

In Eq. 6, Ŷi = (exp (oi)/Σj exp. (oj)) and oi is the relative levels of confidence [12] for belongingness to each class I (0 < Ŷi < 1).

The most likely class may be found in Eq. 7

$$\hat{i}\left(\mathrm{o}\right)=\underset{i}{\mathrm{argmax}}{o}_i=\underset{i}{\mathrm{argmax}}{\hat{\mathrm{Y}}}_i$$
(7)

The cross-entropy loss [71] is expressed as Eq. 8:

$$l\left(Y,\hat{\mathrm{Y}} \right)=-{\Sigma}_{\mathrm{j}}\ {\mathrm{Y}}_{\mathrm{j}}\ \log\ {\hat{\mathrm{Y}}}_{\mathrm{j}}$$
(8)

Where Y is the actual value and Ŷ is the predicted value. The ultimate objective is to minimize the negative log-likelihood [68] or to maximize the accuracy (Eq. 9):

$$\mathrm{L}\ast \left(Y,\hat{\mathrm{Y}} \right)=\mathit{\arg}\underset{{\hat{\mathrm{Y}}}_{\mathrm{i}}}{\min}\sum_{i=1}^nH\left[p\left({Y}_i\right),p\left({\hat{\mathrm{Y}}}_i\right)\right]$$
(9)

Where H[p] is the entropy of distribution [28] p and is calculated as Eq. 10:

$$H\left[p\right]=\sum_j-p(j)\log\ p(j)$$
(10)

The proposed model (Fig. 2) may be described with the help of steps 1 through 9:

  • Step 1. Input tensor is fed in four varied parallel convolutional branches (Eq.1).

  • Step 2. Each convolutional layer is followed by pooling and normalization layers.

  • Step 3. Layer 2 is added with layer 4 (Eq.2).

  • Step 4. All the branches are concatenated (Eq.3).

  • Step 5. The concatenated output is vectorised with time-step.

  • Step 6. The flattened output is injected in bidirectional recurrent layers (Eq.4).

  • Step 7. The recurrent layer is followed by fully connected dense layers (Eq.5).

  • Step 8. Class scores and the most likely class are measured by using Eq.6and Eq.7.

  • Step 9. Loss is measured and minimized by Eq.8, Eq.9, and Eq.10.

Fig. 2
figure 2

Proposed non-sequential recurrent deep neural network model

Unlike a typical sequential model, the combination of branching and re-injecting layers keep important features alive in the system. In each branch, the initial point-wise Convolutional layer determines features that mix information from the channels of the input tensor. Four dissimilar branches form the heterogeneous ensemble that helps in surpassing the limitation of a typical sequential model. Reinjection makes sure that, even if the output of a layer becomes tiny after activation or down-sampling, it gets regenerated from the original layer output. These layers perform steps like segmentation, feature selection implicitly. Had it been a traditional machine learning model, these steps would have to be done manually. The time-distributed flattened layer vectorises the concatenated output and appends the time-step feature needed by the following bidirectional recurrent layers. The output of the flattened layer passes through two bi-directional Long Short Term Memory (LSTM) layers. The bi-directional LSTM layers can learn from their previous and successive layers. This increases the strength of the classifier as it can adjust the weights and bias from both directions. Finally, the fully connected dense layers produce the class scores (Fig. 2).

All the images are resized to 64*64 for convenience in processing. After getting converted to pixel array, the input dataset typically takes the shape of rank 4 tensor: (number of samples, image height, image width, number of color channels). From the available clinical data (Fig. 3), the AJCC label corresponding to each patient has been tied with the respective image array. The whole dataset has been compressed and loaded for the experiment (Fig. 4). The class imbalance issue has been resolved by Synthetic Minority Over-Sampling (SMOTE) [9]. The input has been fed into the proposed model and has been run for 2000 epochs with an early stopping callback value and 200 as patience value. The experiment has been done with 10-Fold Repeated Stratified Cross-Validation with a repeat value of 10 (Fig. 5). It repeats Stratified K-Fold 10 times with different randomization in each repetition. Here, the number of folds is 10 and the cross-validator gets repeated 10 times with a random state value of 999 for each repetition. It reduces preprocessing bias and correlation between data so that the accuracy never gets artificially inflated. Training and validation data have been rescaled for standardization. Each of the convolutional layers has been regularized by L2 or Euclidean norm and followed by batchnormalisation and MaxPooling layers or AveragePooling layers for normalizing and down-sampling the spatial features. Default padding and strides are used. The batch size used is 128 and adam is used as the optimizer with a learning rate of 1e-4. All the hyper-parameters have been determined by conducting a prolonged experiment. At last, the training and validation accuracies are measured and other evaluation metrics [39] like ROC-AUC score, Kappa Statistics, and F1-score are fetched from the confusion matrix. Similar experiments are carried out with other sequential models that performed well in the past in a similar domain [40]. All the experiments have been performed with Python 3.6.8 (IPython 7.5.0) [17].

Fig. 3
figure 3

Glimpse of clinical data from TCGA-BLCA

Fig. 4
figure 4

An instance of pixel arrays and their corresponding AJCC class labels extracted from clinical data

Fig. 5
figure 5

Flow diagram of the overall classification method

6 Discussion

The proposed Non-Sequential Recurrent Model Ensemble (NSRME) has been run on the newly formed dataset along with other models like a sequential CNN model (Fig. 6) and a sequential CNN model combined with bi-directional Recurrent Neural Network (CNN + BiRNN) (Fig. 7). The latter two models were quite successful in classifying NSCLC TNM staging and histology subtypes by using the TCIA Radiogenomics dataset, respectively. The best and average results attained by these models are compared and analysed (Tables 11 and 12).

Fig. 6
figure 6

Sequential CNN model used in the study

Fig. 7
figure 7

Sequential CNN + BiRNN model used in the study

Table 11 Best evaluation results of different models measured by various metrics
Table 12 Average results (with standard deviations) of different models evaluated by various metrics

The best training and validation accuracy of the proposed model is found in iteration 2 at epoch 123. Whereas, the best accuracy of CNN + BiRNN is found in iteration 4 at epoch 216 and the same for the CNN model is found in iteration 5 at epoch 267. From Table 11, it may be interpreted that the proposed model’s performance is ahead of other sequential models. Kappa statistics nearing 1 is quite encouraging, so are the high ROC-AUC score and high F1-Score. Evaluation results (Tables 11 and Table 12) depict a high True Positive Rate (TPR) and less type-I and type-II errors i.e., less False Positives (FPs) and less False Negatives (FNs).

From Figs. 8, 9, and 10 it may be observed that the average validation accuracy and loss of the proposed model are better than other sequential models. From Table 12, it has been found that the average ROC-AUC score of the proposed model is higher than the average ROC-AUC scores of the CNN + BiRNN model and the sequential CNN model. Deviations are also less with the proposed non-sequential model (Table 12). These results imply that classification results have a less rate of miss and fewer false alarms. It has happened as the preprocessing layers of the proposed non-sequential model have not let the important features die out of down-sampling and the bidirectional LSTM layers memorized important features emanated from both the forward and backward path. The average memory usage during the non-sequential model execution also got decreased to 50% which was around 80% during sequential model execution. This has happened as the inception layers acted as cheaper filters and fewer numbers of time-distributed layers have been used than the CNN + BiRNN model. Thus, it may be concluded that the proposed model has performed steadily in classifying heterogeneous classes of tumors.

Fig. 8
figure 8

Training and validation accuracy and loss of the sequential CNN model

Fig. 9
figure 9

Training and validation accuracy and loss of the sequential CNN + BiRNN model

Fig. 10
figure 10

Training and validation accuracy and loss of the proposed non-sequential model

When the results of the newly proposed model have been compared with the recent notable studies (Table 13), it has been found that the non-sequential recurrent ensemble of deep neural network has performed satisfactorily.

Table 13 Comparison of leading studies concerning the parameter Area under the ROC Curve (AUC)

In Table 13, the recent prominent studies have been compared with the proposed one by considering one of the most trustworthy parameters i.e. Area under the ROC Curve (AUC) that depicts the aggregated performance of a classifier against all possible threshold values. From Table 13, it is evident that the performance of the proposed model has indeed matched the top performers in various genres. Most of the existing studies are based on a single tumor-type and a single imaging modality. They often considered a single database and tried to detect subtypes or grades of a particular cancer type. It has been observed that many of them relied on manually crafted features. Thus many important features were ignored and whenever the number of classes increased, the performance was affected. These problems were mitigated by training the proposed model from the scratch and by using automated features. With the newly proposed model, the dataset was a mix of eight databases, imaging modalities were also diverse, and the task was more complicated than the grading of tumors as the number of target classes was more. Despite such intricacy, the non-sequential recurrent model ensemble (NSRME) has truly matched the performance of the leading recent studies. This speaks in favour of the promises made by the proposed model. The study may be considered as a momentous step towards making a revolutionary model that eliminates the need for having different models for identifying different tumor-types.

7 Conclusion

No other model in the existing literature could classify such a varied mix of malignant tumor imagery with such high accuracy. Here lies the novelty of the study. The scientific contribution of the study is also manifold. Unlike the existing models, it helps in determining the overall prognostic group of a tumor irrespective of its type and imaging modality. Once the overall pathological staging is determined, the TNM staging of the respective tumor may also be detected easily. The proposed model may determine histopathological grades and subtypes of different tumors with little customization. In this way, the present study may help medical personnel in determining the stage or grade of tumors in a more assenting way. The present study could not include many tumor genres for a lack of clinical data. In the future, many other types of tumors may be brought under the periphery of the study. In the future, the model may also diagnose blood cancer or leukemia, where no tumors are formed. Experiments may be carried out in the future with different hyper-parameters and meta-learners to improve the model further.