1 Introduction

There are four kinds of blood cancer [1] based on cell types such as acute leukemia (AL) or chronic leukemia (CL) and source types such as myeloid (M) or lymphoid (L). ALL [2] is severe leukemia caused by rapidly progressing malignant lymphoid and is fatal. Hematologists are recognized ALL using microscopic testing of blood samples through a WBC count [3]. It reveals small, spherical, homogeneous blast cells with sparse cytoplasm and nuclei having single nucleoli. However, this microscopic test is the initial manual screening for ALL that usually results in errors and changes in the diagnostic process due to visual homogeneity between normal and ALL cells. On the other hand, there is a painful surgery, such as bone marrow biopsy to diagnose ALL complex using procedures [4].

According to leukemia statistics [5], 6660 patients are expected to be diagnosed with ALL, and 1560 deaths are estimated from ALL in 2022. The majority of ALL cases are in youngsters, yet most of ALL deaths are in adults. As a result, researchers emphasized that treatment of ALL is not easy because of the risk of destroying the healthy WBC together with cancerous cells and thus affecting the patient's immune system [6]. To tackle the mentioned issues, the correct diagnosis of leukemia relies on non-invasive methods by analyzing ALL microscopic images using computational classifiers to increase the survival rate in affected patients [7].

With the growth of modern diagnostic technology, machine learning approaches are used in conjunction with deep learning approaches, which are less susceptible to errors in classifying ALL images. Therefore, an automated approach for leukemia detection has been developed to ease the burden of healthcare professionals in reading microscopic leukemia images and distinguishing normal WBC from ALL. Using the computer-aided leukemia approach, inexperienced healthcare workers may screen ALL with less workforce, lower costs, and time savings [7].

Machine learning approaches [8,9,10] are used for ALL prognoses based on image pre-processing and feature extraction. Among these approaches, the common effective features for ALL classification are color, texture, and statistical features, while the SVM classifier had superior accuracy in detecting ALL compared to other classifiers. However, ALL detection-based machine learning approaches are cumbersome, time-consuming, and require programmer expertise to select appropriate features. Sometimes, the selected features to describe ALL images are unsuitable for providing the preferable classifier model. Also, overfitting is more likely to occur in a classifier trained on a limited dataset, and the result may not be ideal.

Although studies [11, 12] used ALL detection-based CNN approaches to demonstrate that the ensemble model outperformed any single CNN classification model.

The first public database for leukemia is ALL-IDB [13], with two versions used for classifying ALL and normal microscopic images. Putzu et al. [14] demonstrated the machine learning approach based on ALL classification with an accuracy of 93% using ALL-IDB1.

The deep learning approach has been extended to employ swarm optimization for ALL differentiation with an accuracy of 90% [15].

Kokeb Dese et al. [16] developed a classification approach for four types of leukemia using SVM based on a real microscopic database with an accuracy of 97.69%,

Another microscopic cell database called C-NMC 2019 [17] is generic data based on unbalanced categories of normal and ALL images.

The ensemble learning method [18] and ResNeXt model [19] were proposed to classify ALL using C-NMC 2019 database with F1-score of 84% and 87.89%, respectively.

The CNN weighted ensemble [20] is employed to refine the classification of ALL extracted from the C-NMC 2019 database to reach an F1-score of 88.6%.

The shortcomings of the previous researchers are summarized in the following:

  1. (1)

    CNN models need large data for the training process, and the ALL database does not contain large data. Therefore, there is difficulty in applying CNN models to ALL databases.

  2. (2)

    Low performance using machine learning algorithms on ALL images.

  3. (3)

    Unbalanced classes of normal and abnormal blood images in the database.

  4. (4)

    Variation in model performance for classifying ALL images due to differences in image characteristics, image division, and image pre-processing as well as the limited number of ALL images. Therefore, the comparative study between previous and existing models is complex.

  5. (5)

    Most previous works could only be measured by the F1-score ratio to evaluate the ALL diagnostic model, leading to neglect of other performance metrics.

  6. (6)

    Combined application of deep learning and machine learning methods in the field of ALL diagnosis has not achieved much so far.

  7. (7)

    Public databases for ALL images are lower than those of other cancer images.

  8. (8)

    Accuracy of the ALL recognition model is unsatisfactory.

There is still a scientific call to innovate ALL detection algorithms based on the leverage of ensemble learning methods and feature extraction in machine learning models. This research focused on solving these shortcomings in the previous works regarding ALL images.

In this aspect, the proposed ensemble model is designed for ALL detection using deep features and MSVM as a classifier on the C-NMC 2019 database. Deep spatial features or “low-level spatial features” are derived from CNN and temporal features or “high-level temporal features” are derived from BiLSTM [21], and GRU [22] to distinguish between healthy and ALL images accurately.

2 Related work

There are some trials to present CNN automated approaches to handle the classification of ALL and normal cells on the recent C-NMC 2019 database [17], as mentioned in Table 1. Significantly, the F1-score is commonly used to evaluate the CNN models applied to the imbalanced C-NMC 2019 unbalanced database.

Table 1 Deep learning models for ALL classification on the C-NMC 2019 database

Recently, Chen et al. [23] applied the Resnet101-9 ensemble model to the C-NMC 2019 database with an accuracy of 85.11% and an F1- score of 88.94% for identifying ALL images.

The weighted ensemble CNN model was developed by C. Mondal et al. [24] for ALL recognition under the same C-NMC 2019 database with an F1-score of 89.7% and an accuracy of 88.3%.

Marzahl et al. [25] explored a ResNet18 model for classifying ALL images extracted from the C-NMC 2019 database using normalization and augmentation methods with an F1-score of 87.5%.

Another framework [26] adapted the different versions of ResNeXt model to achieve good performance with an F1-score of 85.7%. The MobileNet-V2 model is applied to C-NMC 2019 database [27] to achieve an F1-score of 87.0%.

Pan at el. [28] proposed fine-tuned ResNet to train a C-NMC 2019 database for normal and ALL image classification.

The highest F1-score is achieved using the heterogeneity loss function [29] and NasNetLarge architecture [30] on the C-NMC 2019 database to discern the ALL microscopic images.

Moreover, performance comparisons for ALL recognition with C-NMC 2019 database are inapplicable due to the variance in image size, image division, evaluation parameters, image pre-processing, and only a small percentage of publicly available datasets. The problem with using CNN models is that their good performance depends on the large dataset. Still, in reality, it is unavailable to obtain a large public dataset for ALL images.

Therefore, the classification of ALL images within the C-NMC 2019 database created a major challenge for the early-stage diagnosis with better performance.

Considering machine learning approaches on ALL images, the K nearest neighbor (KNN) clustering is used to detect normal and blast cells on 108 blood images [31]. Also, the SVM classifier was performed to differentiate normal cells from blast cells on 958 microscopic images [32]. Texture features have been proposed for leukemia classification using SVM with a Gaussian radial basis kernel to achieve good performance [14]. It was observed that the local binary pattern and geometric texture features are suggested for ALL detection using the SVM classifier [33]. For the ALL classification task, the SVM classifier is commonly used to classify the blood images into normal and ALL classes [34]. Recently, the MSVM is also used for ALL classification with 94.6% accuracy [9]. The shortcoming of the previous works using machine learning approaches on ALL images is based on a small dataset that leads to overfitting in the results of ALL classification models. Additionally, the manual features extracted from ALL images are inappropriate for categorizing ALL images. Furthermore, the machine learning approaches are implemented with limited accuracy to diagnose ALL images. Currently, there is a demand to produce automated features and a large dataset to accurately ALL diagnose.

3 Methodology

3.1 Proposed framework

The proposed ensemble framework for ALL image classification from the C-NMC 2019 database is shown in Fig. 1. This framework consists of four phases: pre-processing, CNN models to extract spatial features from image data, these features are sent to GRU-BiLSTM architecture for sequence learning, and then classifier. Image pre-processing is annotated using oversampling and splitting into training ad test data. The GRU-BiLSTM architecture consists of GRU with 500 layers, two BiLSTM with 500 layers, and a dropout layer. It was observed that the dropout layers are applied to relieve overfitting by selecting fewer features. The CNN models are five different networks such as ResNet-101, GoogleNet, SqueezeNet, DenseNet-201, and MobileNetV2. Adopting the deep features and test data would lead to classification based on MSVM to obtain the final better performance. For the classification method, two techniques were implemented to classify the pre-processed images: (1) MSVM classifier is used, and (2) fully connected layer (FCL), and softmax layer are used as a classifier.

Fig. 1
figure 1

Proposed ensemble framework for ALL images classification from the C-NMC 2019 database

3.2 Dataset description

The publicly C-NMC 2019 dataset [17] is extracted from the cancer imaging archive (TCIA) used in the ISBI competition to identify the normal B-lymphoid precursors from ALL cells in 12,528 microscopic, including 8491 ALL images and 4037 normal images.

This database originates from 101 cases, including 60 infected cases with ALL and 41 normal cases.

This paper uses the training and preliminary testing dataset, but the final testing dataset is not used. The sample images produced from the C-NMC 2019 dataset are shown in Fig. 2. Each microscopic image had 450 × 450 pixels with two imbalanced classes [35].

Fig. 2
figure 2

Microscopic images from the C-NMC 2019 dataset A ALL cells and B healthy cells

3.3 Data pre-processing

Remarkably, the microscopic images from the C-NMC 2019 dataset are unbalanced, with ALL images being more than normal in the training process. This unbalanced class distribution issue created a biased classifier toward ALL classes and overfitting.

The original microscopic images have been pre-processed using oversampling [36] to remove unbalanced classes in the training dataset before applying them to the network. In this paper, the images in the normal cells were increased to equal 8491 as in the number of ALL images. Thereby, 16,982 images in both classes were divided using a 9:1 split rate for the trained and tested images, respectively. The training dataset consists of 7642 images for both normal and ALL images, as well as the testing dataset consists of 849 images for both normal and ALL images.

3.4 Spatial feature extraction based on CNN, temporal feature extraction based on GRU-BiLSTM architecture, and softmax layer as a classifier

A CNN can learn to acquire low-level spatial features but cannot learn sequential correlations or acquire high-level temporal features. On the other hand, a recurrent neural network (RNN) is a type of feed-forward neural network that is specifically designed for sequence learning or the realization of temporal features. RNNs be able to handle time-series data by applying repeated hidden states, the activation of which at each time stage in a sequence is reliant on the activation value of the prior time stage (s) [37]. Moreover, it has been discovered in several studies that the standard RNN faces the vanishing gradient problem when the data contain long-term correlations [38]. To address this problem, two prominent types of RNNs have been proposed including LSTM [39] and GRU [40]. Noteworthy, many studies have revealed that the LSTM is more successful than the GRU. Although GRU [41] is a revised pattern of LSTM, the working proceedings are completely identical. GRU is faster than LSTM because GRU relies on less memory and fewer training parameters. In contrast, the LSTM suffered from the time-consuming, especially for large data. Notably, the development of LSTM is called BiLSTM [21]. Therefore, the hybrid technique based on two of the most favorable networks is introduced to merge the impact of these two techniques into a single one. The primary objective of this paper is to prove that the mixed strength of GRU and BiLSTM is used to test ALL image classification. This work extracted spatial features from the ALL training images using CNN to decrease CNN network training complexity based on reducing parameter sizes and input dimensions for the CNN network. A combined CNN-GRU-BiLSTM-based deep learning strategy is presented to extract spatial–temporal features and learn the dependency between features. Finally, a softmax function is used for classification as shown in Fig. 3. All the CNN features generated through these five CNN models were established to produce the feature vectors in each model, as shown in Table 2. This work extracted spatiotemporal features from the training images using six layers, including CNN, GRU with 500 layers, BiLSTM with 500 layers, dropout layer 0.25, BiLSTM with 500 layers, and dropout layer 0.25, as mentioned in Fig. 1. After that, all the CNN features extracted through these five CNN models were established to generate the feature vectors in each model, as shown in Table 2.

Fig. 3
figure 3

Proposed GRU-BiLSTM architecture for deep learning predictive network with a softmax layer

Table 2 Description of utilized features according to five CNN models and their corresponding feature vectors for each image

This study used five pre-trained CNN models as below:

  • ResNet-101 architecture [42] consists of 101 layers, but it suffers from complexity and time cost.

  • GoogleNet architecture [43] consists of 22 layers.

  • SqueezeNet architecture [44] consists of 18 layers that are based on a smaller network with few parameters

  • DenseNet201 architecture [45] consists of 210 layers. The power of DenseNet architecture is more than ResNet architecture, but it needs more memory in the training step.

  • MobileNetV2 architecture [46] is focused on an inverted residual model to filter the effective features

The GRU [22] algorithm is based on the RNN [47]. In RNN, gradient problems and the time computational are produced because it can't learn long-term dependencies. It consists of two outputs i.e., ht and ot as shown in the following Eqs. (1,2):

$$h_{t} = f\left( {W_{h} h_{t - 1} + W_{{{\text{hx}}}} x_{t} + b_{h} } \right)$$
(1)
$$o_{t} = f\left( {W_{0} h_{t} + b_{0} } \right)$$
(2)

where f is the activation function on all nodes in the RRN network, h t is a hidden state, t is the time step, W is weight and b is bias. To solve the time computational in RNN, GRU is used to reduce the features extracted from CNN models by using the update gate (zt), reset gate (rt), and current memory gate (\(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\smile}$}}{h}_{t}\)) as the following Eqs. (3, 4 and 5).

$$z_{t} = \sigma \left( {W_{XZ } X_{t } + W_{0Z} O_{t - 1 } + b_{z} } \right)$$
(3)
$$r_{t} = \sigma \left( {W_{Xr } X_{t } + W_{0r} O_{t - 1 } + b_{r} } \right)$$
(4)
$$\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\smile}$}}{h}_{t} = f\left( {W_{h} h_{t} + W_{hx} \left( {r_{t} \odot h_{t - 1} } \right) + b_{h} } \right)$$
(5)

where \(W_{{\text{XZ }}} ,{ }W_{{\text{Xr }}} \;{\text{and}} \;W_{h}\) are weights of the input vector, while \(W_{0Z} ,{ }W_{0r} ,W_{hx}\) are weights of the preceding time step. bz, br, and bh are bias. The update gate implies the prior information must be transferred to the future. The function of the update gate in GRU is similar to the function of the output gate in LSTM. The reset gate implies the prior information should be forgotten. The function of the reset gate in GRU is similar to the function of the input gate and the forget gate in an LSTM. The function of the current memory gate is to reduce the impact that prior information has on the present information that is being transferred to the future.

The output of the CRU layer is used for inputting BiLSTM [48] based on forwarding LSTM and backward LSTM to recognize the features in both the forward and backward directions. The LSTM is based on long-term memory using three gates, including input (it), forget (ft), and output gate (gt) as in Eqs. (6, 7 and 8).

$$i_{t} = \sigma \left( {W_{Xi } X_{t } + W_{hi} h_{t - 1 } + b_{i} } \right)$$
(6)
$$f_{t} = \sigma \left( {W_{Xf } X_{t } + W_{hf} h_{t - 1 } + b_{f} } \right)$$
(7)
$$g_{t} = \tanh \left( {W_{Xg } X_{t } + W_{hg} h_{t - 1 } + b_{i} } \right)$$
(8)

where \({W}_{Xf }, {W}_{Xf }\, {\mathrm\,{and} W}_{xg}\) are weights of the input vector, while \({W}_{hi}, {W}_{hf},{W}_{hg}\) are weights of the preceding time step. bi and bf are biased.

Finally, 1000 systematic features were generated from each image using six combined layers, which were stored in a feature vector.

The proposed GRU-BiLSTM architecture for deep learning predictive network includes four blocks as manifested in Fig. 3. The first block includes deep features from the CNN network. The second block consists of six layers including temporal features based on GRU-BiLSTM architecture. The six layers are sequences input layer, GRU with 500 layers, BiLSTM with 500 layers, dropout layer 0.25, BiLSTM with 500 layers, and dropout layer 0.25. The third block includes the classification using an FCL with a softmax layer. The last block is the output layer that is used to measure the classifier's performance.

3.5 Feature extraction based on CNN-BiLSTM architecture, and MSVM as a classifier

All 1000 feature vectors extracted from CNN-BiLSTM architecture are used as input to the MSVM classifier [49] as shown in Fig. 4. The first and fourth blocks in Fig. 4 are the same as in Fig. 3.

Fig. 4
figure 4

Proposed GRU-BiLSTM architecture for deep learning predictive network with MSVM classifier

For Fig. 4, the second block consists of a sequences input layer, GRU with 500 layers, BiLSTM with 500 layers, dropout layer 0.25, and BiLSTM with 500 layers. The third block includes the MSVM classifier.

This classifier is based on the multiclass binary learning model's Vapnik– Chervonenkis (VC) dimensional. In this paper, this MSVM classifier is employed a one-against-all approach to classify data points from \(n\) classes in the dataset. The kernel function [49] is used in this classifier. Also, the MSVM classifier showed good results on ALL images in previous work [50].

4 Empirical results

The proposed model is executed in the MATLAB 2020a language with a computer with 750 GB RAM and Intel® Core™ i9 processors. The proposed model has applied the C-NMC 2019 dataset [17], which contains 8491 ALL images and 4037 healthy images. To alleviate the class imbalance problem in the C-NMC 2019 database, the proposed model initiated the oversampling process to balance between normal and ALL cells in the training phases. In this paper, the training dataset had 7642 images for ALL cells and 7642 images for healthy cells, while the testing dataset had 849 images for ALL cells and 849 images for healthy cells.

The main objective is to identify CNN models compatible with BiLSTM [21] and GRU to provide good performance for ALL image classification.

The results of the proposed model were divided into two groups. Group (1) used five CNN models, two BiLSTM, and GRU as feature extraction as well as MSVM as a classifier. Group (2) used five CNN models, two BiLSTM, and GRU as feature extraction as well as FCL, and softmax layer as a classifier. The proposed model was exploited in the second group using the FCL and softmax layer instead of the MSVM classifier.

4.1 Performance metrics

The quality of both proposed models to classify ALL images and normal images is measured by five metrics. These metrics are accuracy, sensitivity, specificity, precision, and F1-score, as presented in Eqs. (9, 10, 11, 12 and 13). Accuracy refers to ALL and healthy images correctly categorized, divided by the total number of images in the test set. Sensitivity refers to correctly categorized ALL images divided by all actual ALL images. Specificity refers to correctly categorized healthy images divided by all actual healthy images. Precision refers to accuracy in categorizing an image as ALL. F1-score refers to the average sensitivity and precision.

$${\text{Accuracy}} = \frac{{\mathop \sum \nolimits_{i = 1 }^{c} \frac{{{\text{TP}}_{i} + {\text{TN}}_{i} }}{{{\text{TP}}_{i} + {\text{FN}}_{i} + {\text{FP}}_{i} + {\text{TN}}_{i} }} }}{c}$$
(9)
$${\text{Sensitivity }} = \frac{{\mathop \sum \nolimits_{i = 1 }^{c} \frac{{{\text{TP}}_{i} }}{{{\text{TP}}_{i} + {\text{FN}}_{i} }} }}{c}$$
(10)
$${\text{Specificity}} = \frac{{\mathop \sum \nolimits_{i = 1 }^{c} \frac{{{\text{TN}}_{i} }}{{{\text{TN}}_{i} + {\text{FP}}_{i} }} }}{c}$$
(11)
$${\text{Precision }} = \frac{{\mathop \sum \nolimits_{i = 1 }^{c} \frac{{{\text{Tp}}_{i} }}{{{\text{Tp}}_{i} + {\text{FP}}_{i} }} }}{c}$$
(12)
$${\text{F}}1 - {\text{score}} = 2 \frac{{{\text{Precision*}}\;{\text{Sensitivity }}}}{{{\text{Precision}} + \;{\text{Sensitivity}}}}$$
(13)

where ci is the ith class, TP indicates the amount of correctly categorized ALL images, TN indicates the amount of correctly categorized healthy images, FN indicates the amount of ALL images wrongly categorized as healthy images and FP indicates the number of healthy images wrongly categorized as ALL images.

Also, the confusion matrix is employed to formulate the findings of the test image on both proposed models.

4.2 Results of the CNN models, BiLSTM, GRU with MSVM classifier

The first proposed framework extracted the most productive features from CNN models, BiLSTM, and GRU. The features of the three methods were combined into a single vector, yielding 1000 relevant features per image, and the features were fed into the MSVM classifier.

The performance of five CNN models, namely ResNet-101, GoogleNet, SqueezeNet, DenseNet-201, and MobileNetV2, for the C-NMC 2019 dataset using deep features from BiLSTM, GRU, and MSVM classifier is discussed. Table 3 summarizes the performance of the CNN models with the MSVM classifier on the C-NMC 2019 dataset using 849 test images for both ALL cells and normal cells.

Table 3 The performance of the CNN models with MSVM classifier on the C-NMC 2019 dataset

The DenseNet-201 model outperformed the other models in evaluating the C-NMC 2019 dataset for ALL detection, achieving 96.29% accuracy, 94.58% sensitivity, 98% specificity, 96.23% F1- score, and 97.93% precision. The MobileNetV2 model obtained results similar to the results of the DenseNet-201 model for all the metrics, achieving 96% accuracy, 94.23% sensitivity, 97.76% specificity, 95.92% F1-score, and 97.58% precision.

As for the ResNet-101 model, it obtained 95.76% accuracy, 93.99% sensitivity, 97.53% specificity, 95.68% F1-score and 97.44% precision. By using the MSVM classifier, it can be seen that the SqueezeNet model achieved fewer results than the other models for the C-NMC 2019 dataset. Figure 5 displays the confusion matrix representing the five CNN models for the classification of ALL images from the C-NMC 2019 dataset.

Fig. 5
figure 5

Confusion matrix for the proposed models with MSVM classifier: a MobileNetV2, b ResNet-101, c SqueezeNet, d DenseNet201, and e GoogleNet

The ensemble model with the accurate findings was developed by combining deep learning models, BiLSTM [21], and GRU [22] as favorable feature extraction and MSVM as a classifier.

This proposed ensemble model is feasible to classify ALL images using the blend of CNN models and machine learning techniques.

4.3 Results of the CNN models, BiLSTM, GRU without MSVM classifier

The second proposed framework extracted the most productive features from CNN models, BiLSTM, and GRU. The features of the three methods were combined into a single vector, yielding 1000 features per image. Then, the features were fed into FCL, the basis of the transfer learning, and the FCL output was fed to the softmax activation function. Table 4 summarizes the performance of the CNN models without the MSVM classifier on the C-NMC 2019 dataset using 849 test images for both ALL cells and normal cells.

Table 4 The performance of the CNN models without MSVM classifier on the C-NMC 2019 dataset

The MobileNetV2 model obtained the best results for all the metrics, achieving 92.41% accuracy, 89.75% sensitivity, 95.06% specificity, 92.20% F1-score, and 94.78% precision. The DenseNet201 model obtained results similar to the results of the MobileNetV2 model for all the metrics, achieving 92.35% accuracy, 87.51% sensitivity, 97.18% specificity, 91.96% F1-score, and 96.87% precision. By using FCL and softmax classifier, it can be seen that the SqueezeNet model achieved fewer results than the other models for the C-NMC 2019 dataset.

The confusion matrix shown in Fig. 6 illustrates the preferable performance of the MobileNetV2 model without an MSVM classifier for the classification of ALL images from the C-NMC 2019 dataset.

Fig. 6
figure 6

Confusion matrix for the proposed model without MSVM classifier for MobileNetV2

As shown in Fig. 7, the training progress of the MobileNetV2 model achieved a high accuracy of 92.41% by using the Adam training function, 30 epochs, 0.0001 learning rate, 15 batch sizes, and 28,550 iterations.

Fig. 7
figure 7

Training progress for the MobileNetV2 model

After testing, the best performance of ALL images classifier is the DenseNet-201 model by applying the C-NMC 2019 dataset on the first proposed framework based on the combination of CNN, BiLSTM, and GRU [22] favorable feature extraction with MSVM as a classifier.

4.4 Comparison with previous works on the C-NMC 2019 dataset

As mentioned in Table 1, some previous work used the C-NMC 2019 database to test their method to identify ALL cells and healthy cells. The best F1-score value reached 95.2% on the C-NMC 2019 dataset by Goswami et al. [29]. Previously, Ullah et al. [4] proposed a recent CNN architecture for ALL recognition with an accuracy of 91.1% over the C-NMC 2019 dataset.

Also, the aggregation-based deep learning [30] presented the ensemble model of NASNetLarge and VGG19 to classify ALL cells with an accuracy of 96.5%, a sensitivity of 95.9, a specificity of 96.9%, and an F1-score of 94.6% on the C-NMC 2019 database.

On the other hand, the proposed model proved that the DenseNet-201 model achieved the effective classification of ALL cells with an accuracy of 96.29%, a sensitivity of 94.58%, a specificity of 98%, an F1-score of 96.23%, and a precision of 97.93% on the C-NMC 2019 database.

The limitation of previous works on the C-NMC 2019 database is based on the variance in model performances for the classification of ALL images. Some researchers have only shown the value of the F1-score without accounting for other metrics.

The challenges of using the C-NMC 2019 database are based on highly unbalanced training images, lower intra-class contrast, and high optical similarity between the two classes [51]. Therefore, a poor rating of the previous work on the C-NMC 2019 database is provided for ALL images compared to the proposed ensemble model.

The comparison table of the proposed model with existing models in the C-NMC 2019 database is provided in Table 5. The previous works used the C-NMC 2019 dataset [17], including the training dataset, preliminary test, and final test set. But, the proposed model used the C-NMC 2019 dataset [17], which included only the training dataset and preliminary test. This study is an improvement of similar studies recently developed in [29, 30] and is better than the diagnostic accuracy of hematologists.

Table 5 Performance comparisons between the proposed method and other previous work on the C-NMC 2019 database

5 Discussion

This study introduced an innovative method for the classification of ALL and normal microscopic images, resulting in automated proceedings to aid the development of the ensemble model. The available data for microscopic cells creates comparable algorithms that help hematologists to diagnose ALL cell images [52]. This paper uses the C-NMC 2019 dataset [17] to train and test the propounded classifier for ALL images. After the training process through 7642 images in each class, the test is conducted using only 849 images in each class.

The empirical results emphasized that the proposed model with an MSVM classifier is able to efficiently detect ALL images instead of the proposed model without an MSVM classifier. Regarding the normal/ ALL classification in the C-NMC 2019 dataset, the DenseNet-201 model promoted the best accuracy when using BiLSTM and GRU as feature extraction with the MSVM classifier. The F1-score for the DenseNet-201 model is 96.23%, which achieved the best performance compared to the previous works, as shown in Table 1.

Given the findings, previous work for ALL classification does not guide such significant enhancements in the performance of the proposed model.

Also, the proposed framework suggested the new combined features extracted from the CNN, BiLSTM, and GRU, as detailed in Fig. 1. It can be utilized to elicit the features in many medical image classifications. It means that selecting the most important features will support the highest level of accuracy in a classification task.

Deep learning approaches are becoming popular in the medical image diagnostic system, based only on the large training dataset [53]. Therefore, machine learning approaches are limited in the large training dataset. However, the CNN models allow features to be extracted from the original image without any preliminary image analysis and eliminate any bias during the feature extraction process [54].

Consequently, this study will emphasize that the large dataset is applied to CNN models and machine learning algorithms.

So far, there is no study on the hybrid model in diagnosing ALL images. Therefore, this paper was conducted according to the new application to handle the hybrid model to improve the diagnosis of ALL images.

This proposed model will be a supportive diagnostic tool for ALL image classification on the C-NMC 2019 unbalanced dataset. To ensure credibility in future, the proposed results should be validated by a hematologist.

6 Conclusion and future works

ALL cell-related leukemia causes organ dysfunction in patients. It is not easy to interpret ALL microscopic images with manual techniques. This paper has developed all microscopic image-based automated ensemble models to reach superior performance. To build the ensemble model, the output of the deeply embedded features is fed to the MSVM classifier to determine the preferable model. It was experimentally validated that the deep features of CNN, BiLSTM, and GRU are closely related to the level of best performance in ALL image classification. The proposed ensemble model outperforms the best-performing single model with an F1-score of 96.23% using the MSVM classifier. This model is used as a heuristic approach to deal with the flawed database dilemma and choose the right features.

Further studies are desired to verify the accuracy and sensibility of this proposed model by applying it to different leukemia image databases.