1 Introduction

The prostate is a small gland inside the pelvis and surrounding the urethra with the shape of a walnut. This gland is responsible for producing the prostate fluid which creates semen when it is mixed with the sperm that is produced by testicles [1]. Prostate cancer is developed in the prostate when the cells began to grow in an unusual way and invade the surrounding organs and tissues.

According to the World Health Organization, Prostate cancer is the second most dominant and diagnosed cancer among males and represents one of the leading cancer death causes worldwide with 1,414,259 new cases (7.3% of the overall cancer new cases) and 375,304 death cases (3.8% of the total cancer death cases) globally and those number will be increased to 2,430,000 new cases and 740,000 death cases by 2040 [2]. In Egypt, prostate cancer is one of the most common cancers among Egyptian men with 4,767 new cases (3.5% of all new cases from various cancer types) and 2227 new death cases (2.5% of the overall total death cases for all cancer types) and by 2040 the morbidity and mortality cases will be doubled and reached 9,610 new cases and 4,980 new death cases [3]

These statistics for both Egypt and worldwide showed us that the mortality and morbidity of this cancer type are increased in a dramatic way that makes it the fastest malignancy of cancer among males, so the early detection and diagnosis of prostate cancer are very crucial for improving patient care and increasing the patient’s survival rate [2, 3]

A lot of prostate cancer cases grow very slowly to cause any serious problems so there is no need for treatment. However in other cases, prostate cancer grows speedily and spread to the surrounding tissues and organs [4]. Adenocarcinomas is the most common type of prostate cancer. Other types of prostate cancer include transitional cell carcinomas, neuroendocrine tumors, sarcomas, small cell carcinomas, and squamous cell carcinomas [5].

The appearance of prostate cancer symptoms depends on the stage of the disease if it was an early, advanced, or recurrent stage. The symptoms of prostate cancer may include (1) bone pain, (2) trouble urinating, (3) blood in the urine, (4) fatigue, (5) painful ejaculation, (6) blood in the semen, (7) jaundice, (8) numbness for feet or leg, (9) losing weight without trying, (10) erectile dysfunction, and (11) force decreasing in the urine stream [6, 7].

The related risk factors of prostate cancer depend on the person’s lifestyle, family history, and age. The risk factors which increase the chance of having prostate cancer involves (1) obesity, (2) old age (i.e., after age 50), (3) family history, (4) ethnicity (i.e., the black men have a high probability to diagnose with prostate cancer), and (5) genes or cells DNA change [8].

There are various methods for diagnosing prostate cancer but screening tests are the most effective ways for detecting it in an early stage and it includes (1) prostate-specific antigen (PSA), (2) digital rectum examination (DRE), and (3) Biopsy [9]. PSA is a protein found on the prostate cells and made by its glands for keeping the semen in a liquid shape. Almost PSA lies in semen but also a small quantity exists in the bloodstream [10]. A high level of PSA refers to a high probability of being diagnosed with prostate cancer [11].

DRE is one of the important tools for screening prostate cancer by checking any abnormalities in the prostate area [12]. Using this test, any abnormality in the size, texture, or shape of the prostate gland can be noticed. DRE might be added beside the PSA test for screening prostate cancer [13].

Biopsy represents a medical procedure that involves taking small samples of the prostate tissues based on the gland size and examining them under a smart microscope to detect the existence of cancer cells in the prostate tissues and diagnose prostate cancer [14]. The doctor decides to make a biopsy based on the results of the DRE and PSA. Also, the doctor may require imaging tests such as magnetic resonance imaging (MRI) or Transrectal Ultrasound [15]

There are different treatment options for prostate cancer involving (1) surgery to remove the prostate cancer cells through prostatectomy procedure, (2) radiotherapy which includes brachytherapy (when the radiation pass inside the patient body) and external beam radiation therapy (when the radiation come from the body outside), (3) active surveillance which used when the cancer located only in the prostate and didn’t spread to the surrounding areas and utilized for monitoring the cancer growth, (4) watchful waiting that is similar to active surveillance but it monitoring the cancer condition without treatment and focusing on managing the symptoms, (5) focal therapy which being utilized when the cancer did not spread to another tissues and focusing on treating the prostate area affected by the cancer, otherwise systemic therapies utilized if the cancer succeed for spreading outside the prostate area, and (6) cryotherapy destroys the cancer cells through controlled freezing of the prostate gland, this type of treatment had been used for patients with a health issues to prevent them from having a radiotherapy treatment or surgery [16,17,18]. A graphical summary concerning the prostate cancer types, diagnosis, treatment, and symptoms is shown in Fig. 1.

Fig. 1
figure 1

A graphical summary concerning the prostate cancer types, diagnosis, treatment, and symptoms

In the last decades, a lot of prostate cancer detection approaches are proposed but they could not detect and diagnose cancer effectively. Recently, artificial intelligence (AI) had an important and crucial role in the diagnosis and detection process for different cancer types involving prostate cancer [19]. AI contains a subfield called Deep Learning (DL) which represents the state-of-the-art choice when we want to classify medical images from different images modalities (e.g., MRI, Ultrasound, and computed tomography (CT)) and extract features from images to determine if they contain a tumor or not [20].

For prostate cancer disease, researchers have deployed DL for detecting the disease through classifying medical images from different modalities and analyzing the biopsy images, and screening test results to detect any abnormalities in the prostate [21]. The deployment of DL had a positive impact on the diagnosis process by decreasing time and cost and helping in the early detection of the disease that lead to improving the quality of healthcare provided for prostate cancer patients [22].

The current study introduces a hybrid framework for precise diagnosis and segmentation of prostate cancer using deep learning techniques. Diagnosis is performed using eight different pretrained CNN models, namely ResNet152, ResNet152V2, MobileNet, MobileNetV2, MobileNetV3Small, MobileNetV3Large, NASNet Mobile, and NASNet Large. Aquila optimizer is applied to tune the CNN models for better accuracy. According to the results of the diagnosis, the segmentation phase is triggered. It is crucial for physicians to determine the size and exact position of the tumor for better treatment. In the proposed framework, segmentation is performed via the U-Net model. Another merit of our proposed model is the use of three different datasets, namely (1) “PANDA: Resized Train Data (512 × 512),” (2) “ISUP Grade-wise Prostate Cancer,” and (3) “Transverse Plane Prostate Dataset.” The use of data from various datasets ensures the robustness of the presented framework.

1.1 Paper contributions

The contributions of the present study are:

  • Developing a hybrid framework for precise diagnosis and segmentation of prostate cancer.

  • In the diagnosis phase, eight different transfer learning CNN models have been applied in order to find the most promising model.

  • Aquila optimizer is applied to tune the hyperparameters of the CNN models to find the best combinations with best performance.

  • In the segmentation phase, U-Net is applied for segmenting prostate cancer to five grades.

  • Comparing the performance of our framework with the state-of-the-art related studies.

1.2 Paper organization

The paper is organized in five sections. After the introduction section, the related state-of-the-art studies are presented in Sect. 2. Methodology of solution and the experimental results of the proposed framework are given in Sects. 3 and 4. The final section gives the conclusion, limitations, and the directions of future trends of the current study.

2 Related studies

Recently, there are extensive research works on prostate cancer detection and diagnosis based on artificial intelligence approaches involving machine learning and deep learning approaches [23,24,25]. The recent studies used different approaches, various datasets, and tools to facilitate the recognition of prostate cancer from MRI and CT images. The related studies are divided into studies that focused on (1) traditional machine learning (ML) algorithms, (2) deep learning (DL) approaches, and (3) hybrid approaches of DL and ML.

2.1 Machine learning-based studies

Zhang et al. [26] proposed a new framework for diagnosing prostate cancer and segmenting lesions in MRI by using various techniques such as Support Vector Machine, Multilayer Perceptron, and K-Nearest Neighbors. The obtained results showed that accuracy equals 80.97% and Dice value equals 0.79.

Erdem et al. [27] suggested an ML framework involving different techniques and algorithms for detecting prostate cancer such as Random Forest, Support Vector Machine, Deep Neural Network, Multilayer Perceptron, Logistic Regression, Linear Discrimination Analysis, and Linear Regression. The result showed that the Multi-layer Perceptron had achieved the best performance with 97% accuracy, 100% recall, 0.958 AUC, 95% precision, and 97% F1-score.

Nayan et al. [28] applied different ML approaches such as fully connected Artificial Neural Network, Random Forest, Support Vector Machine, and Logistic Regression for predicting progression of the prostate cancer active surveillance to 790 patients. The best result was achieved by the Support Vector Machine classifier with F1-score equal to 0.586.

2.2 Deep learning-based studies

Gentile et al. [29] proposed a DL model framework for making optimized identification for prostate cancer by utilizing various PSA molecular forms such as free PSA, P2PSA, PSA density, and Total PSA. The proposed model was deployed with 437 patients and achieved 86% sensitivity and 89% specificity.

Shrestha et al. [30] suggested a novel approach for segmenting prostate cancer based on DL models and batch normalization on 230 MRI images. They utilized feature extraction optimization to extract multi-level features (i.e., low and high-level features) to perform prostate localization and shape recognition to improve prostate cancer segmentation and diagnosis. The proposed model achieved 95.3% accuracy for the segmentation process. Khosravi et al. [31] a DL framework for diagnosing prostate cancer by utilizing the fusion of pathology radiology and analyzing biopsies and MRI datasets for 400 patients with histological data and with prostate cancer suspicion. The proposed framework distinguishes between cancer and benign and determines the risk probability level for patients if it was in a high or a low level. Their model has Area under Curve (AUC) equal to 0.89.

Wessels et al. [32] carried out the DL approach of convolutional neural network (CNN) for predicting Lymph Node Metastasis from tumor histology of prostate cancer. The dataset consisted of stained tumor slides of 218 patients and they obtained 0.83 area under the receiver operating characteristics (AUROC) by utilizing CNN and Lymphovascular invasion. Linkon et al. [33] suggested a DL framework to diagnose prostate cancer and grade the Gleason score of histopathological images from 4 different datasets by deploying a CNN model and a post-processing technique to increase the performance of the model. They achieved 0.8921 Precision and 0.8460 F1- score.

Patel et al. [34] performed the detection of prostate cancer through DL techniques to deal with the MRI images for 158 patients. The performance of the detection process was determined by the Ultrasound-guided and MRI-targeted biopsies. They yielded 82% accuracy, 86% precision, and 78% recall. Shao et al. [35] proposed a DL pipeline called ProsRegNet for simplifying the registration of the prostate histopathology and MRI images. They used 654 pairs of MRI and histopathology slices for 152 prostate cancer patients and obtained Dice coefficient values in the range of 0.96 to 0.98.

Amarsee et al. [36] carried out an automated tracking and detection of implanted marker seeds in prostate cancer patients by utilizing DL approaches. The marker seeds represented an identification for the prostate volume positioning in the prostate cancer treatment. CNN model analyzed 1,500 images and achieved 98% accuracy. Yang et al. [37] suggested an approach for the automatic detection of prostate cancer of a multi-parametric MRI (mpMRI) dataset for 780 patients. They used a CNN model and achieved 0.9684 AUC. Kovalev et al. [38] performed a computerized prostate cancer diagnosis that depended on DL techniques. They used 10,616 slide histology images. They deployed different CNN architectures and feature extraction techniques. The best-obtained performance was 92.77% accuracy.

John et al. [39] proposed a prostate cancer prediction approach for 330 mpMRI images from the Prostate X challenge dataset. They deployed two CNN architectures, namely DenseNet and MobileNet to analyze the prostate screening methods including DRE and PSA tests. They obtained 0.931 AUC and 0.93 F1-score. Comelli et al. [40] developed a DL framework involving three approaches for segmenting prostate cancer in MRI images for 85 patients and they yielded a 90.89% Dice coefficient. Pinckaers et al. [41] carried out prostate cancer detection in slide images of biopsies. They used a dataset that consisted of 1243 slides containing 5759 prostate biopsies. The proposed CNN model yielded 0.992 AUC.

Salvi et al. [42] proposed a hybrid DL approach called RINGS (i.e., Rapid IdentificatioN of Glandural Structures) for segmenting prostate glands of histopathological images which support a prostate cancer diagnosis process. They utilized 1500 prostate cancer biopsy images for 150 men. The suggested approach achieved a 90.16% Dice score, 91.24% precision, and 97.23% recall. Korevaar et al. 43 developed a new prostate cancer detection framework for CT scans. The dataset involved 571 CT scans (139 prostate cancer patients through MRI or Transrectal ultrasound-guided biopsy and 432 control cases for unknown prostate cancer). The results showed that the specificity is 98.8% and 0.88 AUC.

Chahal et al. [44] suggested a novel framework of CNN architecture (Xception model) that was based on the U-Net. The used dataset contained 1,689 2D gray-scaled MRI images for 50 patients. They achieved an overall Dice value of 97.50%. Sobecki et al. [45] developed a prostate cancer diagnosis framework based on the CNN model and mpMRI dataset containing 538 images for 344 patients. They achieved 0.84 AUC. Balagopal et al. [46] suggested a DL approach for automatically segmenting Clinical Target Volumes with uncertainties of prostate cancer radiotherapy on CT images. The proposed CNN model was utilized for the localization of Organs at Risk and Clinical Target Volumes. They obtained a high Dice Similarity Coefficient value of 0.87.

Liu et al. [47] proposed a DL approach representing in Computer-Aided Diagnosis system for ultrasound image-aided prostate cancer diagnosis containing S-mask R-CNN and InceptionV3. They utilized a dataset involving 1,200 ultrasound images and yielded a Dice coefficient equal to 0.87 and a precision score for malignant and benign categories equal to 0.8 and 0.76, respectively. Abdelmaksoud et al. [44] used the transfer learning (TL) approach on the DWI dataset which consists of 470 Diffusion Weighting slices (234 malignant cases slices and 236 benign cases slices). The proposed approach achieved 91.2% accuracy, 90.1% specificity, and 91.7% sensitivity.

Ambroa et al. [48] suggested a TL approach based on the CNN model for predicting the dose volume of a histogram for prostate cancer radiotherapy. They carried out a CNN model to predict bladder and rectum Dose Volume Histogram of prostate patients. They used a database containing 2D images of CT scans for 144 patients and achieved 87.5% accuracy and 100% precision. Hao et al. [49] performed prostate cancer detection by utilizing various DA strategies on Diffusion-Weighted magnetic resonance Imaging (DWI) dataset containing 10,128 2D slices of 414 patients through utilizing different CNN models. They obtained 0.85 AUC.

Kudo et al. [50] deployed a CNN approach to diagnose prostate cancer in 32 prostate cancer biopsy images and 2594 fragments. They obtained an overall accuracy of 98.3%. Mehta et al. [51] utilized clinical features and mpMRI images on PICTURE and ProstateX datasets. Their achieved AUC for both datasets were 0.79 and 0.86, respectively. Hoar et al. [52] performed semantic segmentation for prostate cancer on mpMRI images by utilizing a combination of TL and test-time augmentation to improve the ability of the CNN model to distinguish between cancer or non-cancer. They used 154 subjects for mpMRI and yielded 0.93 AUROC. Cipollari et al. [53] suggested a CNN model to perform an automated prostate cancer classification on 316 mpMRI images for 312 men and the best performance among sequences were 100% and 96.62% accuracy.

Pellicer-Valero et al. [54] developed a DL approach for performing automated segmentation, estimation of Gleason Grade, diagnosis, and detection of prostate cancer in 490 mpMRI images for 75 patients collected from two datasets (ProstateX dataset and Valencia Oncology Institute Foundation dataset). They obtained an overall Dice Similarity Coefficient in the range between 0.894 and 0.941. Saunders et al. [55] suggested a DL framework for segmenting prostate cancer using the U-Net model. They utilized TL and DA techniques on the used 3 MRI scan datasets to increase the data and improve the model performance. The proposed model obtained a 0.9 Dice Similarity Coefficient.

Han et al. [56] developed DL models to detect bone metastasis on the body bone scan in prostate cancer patients. They utilized 9113 bone scans (i.e., 2991 positives and 6142 negatives) of 5342 prostate cancer patients and two different CNN architectures representing Global Local Unified Emphasis and Whole Body-based. The obtained results were 90% accuracy (and 0.946 AUC) and 88.9% accuracy (and 0.944 AUC), respectively.

2.3 Hybrid approaches of deep learning and machine learning-based studies

Iqbal et al. [57] proposed a prostate cancer detection based on DL approaches such as Long Short-Term Memory and ResNet, and traditional techniques for 230 MRI scans of patients with various descriptions and categories. They also used different non-DL models such as K-Nearest Neighbors Cosine, Naïve Bayes, RUS Boost tree, Support Vector Machine, and Decision Tree. The results showed that the ResNet model has obtained the best performance among the DL approaches with a 100% accuracy and 1.0 AUC.

Salama et al. [58] performed a prostate cancer detection depending on Deep Convolutional Neural Network and Support Vector Machine. They used the DWI dataset and deployed various techniques involving TL and Data Augmentation (DA). After performing the DA technique, 1,765 images existed for training the model to classify whether prostate cancer exists or not. The proposed work obtained 98.79% accuracy, 98.43% sensitivity, 0.9592 F1-score, 0.9891 AUC, and 97.99% precision.

Table 1 gives a summary of the discussed related studies in 2021.

Table 1 Related studies summarization

3 Methodology

As noticed from the related studies, the common disadvantages of the presented studies are either low accuracy or lack of data. Therefore, we suggest a hybrid framework for accurate diagnosis and segmentation of prostate cancer using deep learning. To ensure the variety and robustness of the proposed model, three different datasets are applied, namely (1) “PANDA: Resized Train Data (512 × 512),” (2) “ISUP Grade-wise Prostate Cancer,” and (3) “Transverse Plane Prostate Dataset.” Data preprocessing is applied to unify the different datasets to one single platform. The proposed framework is divided to two stages, namely classification stage and segmentation stage. In the first stage, eight different deep learning algorithms via transfer learning approach are used to classify patients of prostate cancer from normal ones. Aquila optimizer is used to fine-tune the hyperparameters of the various models. The second stage begins once the patient is diagnosed with prostate cancer. This phase is important to help doctors identify the infected regions and therefore, the size and shape of the tumor can be correctly recognized.

The current section presents the suggested approach for prostate cancer classification and segmentation. It is divided into different phases that are summarized in Fig. 2. In short, it starts by acquiring the datasets. The current study uses three datasets, one for segmentation and two for classification. After that, it applies different pre-processing techniques such as data augmentation, resizing, and scaling. Different transfer learning CNN models are utilized in the image classification and hyperparameters optimization phase. The Aquila optimizer is used to optimize the hyperparameters in that phase. Different performance metrics are utilized to judge the system performance. The U-Net segmentation model is utilized to determine and grade the locations of the tumors.

Fig. 2
figure 2

The suggested approach for prostate cancer classification and segmentation

3.1 Data acquisition

The prostate cancer classification and segmentation in the current study are performed by utilizing 3 datasets that are obtained from Kaggle online platform. The used datasets are (1) “PANDA: Resized Train Data (512 × 512)”, (2) “ISUP Grade-wise Prostate Cancer”, and (3) “Transverse Plane Prostate Dataset.”

3.1.1 The “PANDA: resized train data (512 × 512)” dataset

Prostate cANcer graDe Assessment (PANDA) represents as the biggest whole-slide image collection public dataset with about 11,000 slide images for digitized biopsies of H&E-stained. The dataset is provided by two centers (i.e., “Karolinska Institute” and “Radboud University Medical Center”) and is utilized to perform the segmentation process. In the two centers, the prostate glands are labeled into the stroma, malignant epithelium, benign epithelium, and non-tissue [61]. The “PANDA: Resized Train Data (512 × 512)” dataset is a resized version of PANDA which is used in the current study [62].

3.1.2 The “ISUP grade-wise prostate cancer” dataset

The “ISUP Grade-wise Prostate Cancer” dataset consists of 10,616 prostate images which contain a severity on a scale grading from zero to five representing the cancer condition if it was significant or non-significant [63, 64].

3.1.3 The “transverse plane prostate dataset” dataset

The “Transverse Plane Prostate Dataset” contains a total of 1,528 prostate MRI images from 64 patients in the transverse plane. The dataset is partitioned into two categories (i.e., significant and non-significant) [64, 65].

3.2 Data preprocessing

Data preprocessing is an essential process to prepare the dataset for the next phase and make the data more efficient in performing the classification and segmentation processes effectively. It involves various techniques such as data augmentation, resizing, and scaling [66].

3.2.1 Data augmentation

Data Augmentation (DA) is a process involving the dataset augmenting and trying to increase it to a diverse and large one [67]. DA techniques improve the model performance. Different augmentation techniques can be applied including (1) zooming it in (or out), (2) rotation, (3) flipping horizontally (or vertically), (4) shifting vertically (or horizontally), (5) shearing vertically (or horizontally), (6) changing the brightness of the image, and (7) cropping [68,69,70]. Table 2 shows the used ranges in the current study.

Table 2 The used data augmentation techniques and their ranges

3.2.2 Data resizing

The images in the current study are resized to be equal in size and to make the training process quicker through smaller images. For the segmentation dataset, the used size is \(\left( {128,128,3} \right)\) while in the classification datasets, the used size is \(\left( {100,100,3} \right)\).

3.2.3 Data scaling

Data scaling is one of the pre-processing techniques that normalize the attribute values to be within a known scale (or range) to improve the overall performance. There are different scaling techniques but the used ones in the current study are (1) min–max scaling, (2) normalization, (3) standardization, and (4) max-absolute scaling [71, 72].


Min–Max Scaling transforms all the features into a range between 0 and 1. Also, the dataset is normalized into the range between − 1 and + 1 in case of the existence of negative values on the data [73]. Equation 1 shows how to calculate it, where \({\text{out}}\) is the output, \({\text{in}}\) is the input, \({\text{in}}_{{{\text{max}}}}\) is the maximum value, and \({\text{in}}_{{{\text{min}}}}\) is the minimum value.

$${\text{out = }}\frac{{{\text{in}}{ - }{\text{in}}_{{{\text{min}}}} }}{{{\text{in}}_{{{\text{max}}}} { - }{\text{in}}_{{{\text{min}}}} }}$$
(1)

Normalization squeezes the dataset in the range between 0 and 1. It is an efficient way in the classification process especially with datasets involving negative values [74]. Equation 2 shows how to calculate it.

$${\text{out = }}\frac{{{\text{in}}}}{{{\text{in}}_{{{\text{max}}}} }}$$
(2)

Standardization standardizes each feature value within the distribution by removing the mean (i.e., mean equal to zero) and scaling the values into unit variance [70]. Equation 3 shows how to calculate it, where \(\mu\) is the mean value and \(\sigma\) is the standard deviation.

$${\text{out}} = \frac{{{\text{in}} - \mu }}{\sigma }$$
(3)

Max-Absolute Scaling looks like the min–max scaler but the values are not scaled into the range within 0 and 1. It is translated and normalized into the range between \(- 1\) and \(+ 1\) by dividing it by the maximum absolute value [75]. Equation 4 shows how to calculate it.

$${\text{out}} = \frac{{{\text{in}}}}{{\left| {{\text{in}}_{{{\text{max}}}} } \right|}}$$
(4)

3.3 Image segmentation

Image segmentation is an essential process in the computer vision field to get more detailed information. An essential need for image segmentation jobs is the usage of masks. We may get the desired result necessary for segmentation by using masking, which is essentially a binary picture consisting of zero and nonzero values. The segmentation objective is to identify a label for every pixel of the image that represents what is being represented [76]. Deep learning approaches have recently been utilized to segment medical images using a variety of architectures including U-Net, U-Net ++ , Swin-Unet, Attention U-Net, and V-Net [77].

U-Net is utilized in the current study. It is a convolutional network architecture for images segmentation that is quick and precise. It is a technique of constructing a network of sliding window architecture by treating each pixel’s class label as a distinct unit and giving a local area (i.e., patch) surrounding it [78]. The goal of the U-Net is to collect both context and localization characteristics. It is built on the principle of using consecutive contracting layers, followed by upsampling operators, to get better resolution outputs on the input pictures.

With U-Net, computations may be completed in a short amount of time using a contemporary Graphics Processing Unit (GPU). The encoder and decoder are the two primary components of the U-Net design. The encoder is made up of several layers that are followed by a pooling process and it is used to extract the image’s factors. To allow for localization, the decoder employs transposed convolution, and it is a fully connected layers network [79].

The U-Net model is applied for each masked layer on the “PANDA: Resized Train Data (512 × 512)” dataset and hence there are five segmenters one for each layer. Table 3 shows the hyperparameters used in the U-Net training process.

Table 3 The used U-Net hyperparameters in its training process

3.4 Learning and classification using transfer learning

Deep learning (DL) is a rapidly expanding machine learning (ML) discipline with applications in image identification, picture creation, self-driving cars, and medical sciences [80]. The architecture of DL is made up of several linked layers of weighted components that are comparable to neurons [81]. The main common characteristic of DL methods is their focus on feature learning and automatically learning representations of data [82]. Feature extraction is a technique that is utilized for capturing and extracting relevant and required attributes and information of images to use in further processing. It is deployed on rich datasets involving medical images modalities such as CT, MRI, and Ultrasound [83].

3.4.1 Convolutional neural network (CNN)

When we extract DL features from the image’s dataset, we utilized CNN as an automatic feature extractor. CNN is represented as one of the most powerful tools for DL tasks which take an input image, capture features from it using kernels or digital filters, classify higher dimension datasets, and transfer it to lower dimensions without missing any information [85,86,87].

A CNN’s architecture is composed of multilayers including an Input Layer that contains the input image and holds its pixel values. The input image is passed through a convolutional layer which extracts multiple levels of information using kernels and filters with predefined widths and heights. It will determine which output neurons are linked to a particular part of the input data [88]. By sliding the filters on the input, multiple convolution operations are conducted to extract distinct features levels from the image and stack them [89].

The pooling layer downsamples the input image and decreases the image’s parameters, reducing training time and overfitting without sacrificing critical data. After the pooling procedure, the fully connected layer represents a flattened feed-forward layer that facilitates classification processes. As the output of the convolutional layer, nonlinear combinations of features are learned after the downsampling and feature extraction procedures [90]. The fully linked layer connects all neurons in the previous and subsequent layers. To produce predictions and classify the input data into multiple classes, it can be a nonlinear activation function [91].

The batch normalization layer is one of the main layers in the CNN architecture that improves the model’s performance and speeds up the training procedure by (1) allowing a wider range of learning rates and (2) re-parametrizing optimization problems, resulting in a more stable, smoother, and faster training process [92]. The activation layer is essential to have the node’s output such as Sigmoid, Hyperbolic Tangent (Tanh), Rectified Linear Unit (ReLU), Leaky ReLU, Exponential Linear Unit (ELU), Scaled Exponential Linear Unit (SeLU), and SoftMax functions [93].

3.4.2 Transfer learning (TL)

It represents an artificial intelligence (AI) approach in which a pre-trained model that was previously used for one job is reused for another. It may be summed up as the transfer of knowledge [80]. The basic concept behind reusing pre-trained models for new processes is to get a starting point and a lot of labeled training data in a new one that doesn’t have much data, rather than beginning from scratch and producing these labeled data, which is highly expensive [94, 95]. On the ImageNet image database [96], several pre-trained CNN models were developed, but the ones utilized in this study are ResNet152, ResNet152V2, MobileNet, MobileNetV2, MobileNetV3Small, MobileNetV3Large, NASNet Mobile, and NASNet Large.

3.4.3 Parameters optimization

It is representing an expanded method for changing weights of the CNN model aiming at the probability of achieving accurate results and reducing losses [97]. There are many types of optimizers but the used ones in the current study are (1) Stochastic Gradient Descent (SGD), (2) Stochastic Gradient Descent-Nesterov (SGD-Nesterov), (3) Adaptive Gradient (AdaGrad), (4) Adaptive Delta (AdaDelta), (5) Adaptive Moment Estimation (Adam), (6) Adaptive Maximum (AdaMax), (7) Root Mean Square Propagation (RMSprop), (8) Nesterov-accelerated Adaptive Moment Estimation (Nadam), (9) Root Mean Square Propagation-Centered (RMSprop-Centered), (10) Follow the Regularized Leader (FTRL), and (11) Adaptive Method Setup Gradient (AMSGrad) [98,99,100].

3.4.4 Learning hyperparameters

The Loss Function is crucial in evaluating the suggested solution by referring it to the needed function and calculating model errors [101]. As a result, we can determine how good the model is and adjust the parameters to improve the model’s performance and reduce the overall loss [102]. It can be used as a penalty for failing to achieve the target output. For example, if the model’s predicted value differs significantly from the desired value, the function will return a large loss value and a lesser number otherwise [103]. The utilized losses in the current study are (1) Categorical Hinge [104], (2) Poisson [105], (3) Squared Hinge [106], (4) Categorical Crossentropy [107], (5) Hinge [108], and (6) KLDivergence [109].

The Batch Size shows the number of data records used to train the model in each iteration to ensure model generalization, parameter values, and loss function convergence. It is critical to the model’s learning process since it makes it faster and more stable [99, 110]. The Dropout is a regularization strategy for training the CNN model on any or all of the architecture’s hidden layers. It is critical in preventing and correcting the overfitting problem to maintain optimal performance. By setting the output of every neuron to 0 at random, it can boost generalization efficiency across all data [111]. The hyperparameters to be optimized by the Aquila optimizer are summarized in Table 4.

Table 4 The hyperparameters to be optimized by the Aquila optimizer

3.5 Meta-heuristic optimization using Aquila Optimizer (AO)

It is a popular choice for modeling and solving complex issues that are difficult to tackle with standard methods and optimization. The term “meta” in Meta-heuristic refers to a higher level of performance that outperforms simple heuristics [112,113,114,115,116,117,118]. For global exploration and local search, it employs a trade-off [119, 120]. Diversification and intensification are crucial aspects of meta-heuristic algorithms. Diversification creates a variety of alternatives for exploring the search space, whereas intensification concentrates on the search in a particular area by utilizing information to find a good answer in that area.

Aquila Optimizer (AO) is a cutting-edge meta-heuristic optimization technique. The AO algorithm’s optimization process is divided into four steps (1) choosing the search space by vertical stooping (Equation 5), (2) discovering the various search spaces by contour flight by short glide attack (Equation 6), (3) swooping by grabbing prey and walking (Equation 7), and (4) exploiting through converge search space by low flight by descent attack (Equation 8), where \(N\) is the population size, \(X\left( {t + 1} \right)\) is the solution of the next iteration, \(t\) is the iteration number, \({\text{rand}}\) is a random number in the range \(\left[ {0,1} \right]\), \(T\) is the total number of iterations, \({\text{XR}}\left( t \right)\) is a random solution in the current iteration \(t\), \(D\) is the dimension space size, \(X_{{{\text{best}}}} \left( t \right)\) is the best solution in the current iteration \(t\), \({\text{Levy}}\left( D \right)\) is the levy flight distribution function, \(U\) equals 0.00565, \(r1\) is a value in the range \(\left[ {1,20} \right]\), \(D1\) is a value in the range \(\left[ {1,D} \right]\), \(\alpha\) (and \(\sigma\)) equal to 0.1, \(UB\) is the upper bound, \(QF\) is the quality function, and \({\text{LB}}\) is the lower bound. The fixed values are taken from the original AO paper.

The optimization techniques in AO begin by producing a random preset collection of candidate solutions known as the population. Equation 9 shows how to create the AO population where \({\text{UB}}\) and \({\text{LB}}\) are the upper and lower bounds respectively and \({\text{rand}}\) is a random vector between 0 and 1. After that, the AO search strategies examine the positions of the optimum solution using the repetition trajectory or the near-optimal one [121]. Every solution in the AO’s optimization phase updates its location based on the best solution. The AO conducts a series of experiments to verify the optimizer’s capacity to identify the optimum solution for various optimization tasks. The flowchart of AO is shown in Fig. 3.

$$X_{t + 1} = X_{{{\text{best}}}} \left( t \right) \times \left( {1 - \frac{1}{T}} \right) + \left( {\frac{{\mathop \sum \nolimits_{i = 1}^{N} X\left( t \right)}}{N} - X_{{{\text{best}}}} \left( t \right) \times {\text{rand}}} \right)$$
(5)
$$X_{t + 1} = X_{{{\text{best}}}} \left( t \right) \times {\text{Levy}}\left( D \right) + XR\left( t \right) + \left( {r1 + U \times D1} \right) \times {\text{cos}}\left( { - \omega \times D1 + 1.5 \times \pi } \right) - {\text{sin}}\left( { - \omega \times D1 + 1.5 \times \pi )} \right) \times {\text{rand}}$$
(6)
$$X_{t + 1} = \left( {X_{{{\text{best}}}} \left( t \right) - \frac{{\mathop \sum \nolimits_{i = 1}^{N} X\left( t \right)}}{N}} \right) \times \alpha - {\text{rand}} + \left( {\left( {\text{UB - LB}} \right) \times {\text{rand}} + LB} \right) \times \sigma$$
(7)
$$X_{t + 1} = QF \times X_{{{\text{best}}}} \left( t \right) - X\left( t \right) \times {\text{rand}} \times \left( {2 \times {\text{rand}} - 1} \right) - {\text{Levy}}\left( D \right) \times 2 \times \left( {1 - \frac{t}{T}} \right) + {\text{rand}} \times \left( {2 \times {\text{rand}} - 1} \right)$$
(8)
$${\text{population}} = {\text{rand}} \times \left( {{\text{UB}} - {\text{LB}}} \right) + {\text{LB}}$$
(9)
Fig. 3
figure 3

The flowchart of AO

3.6 Performance metrics

There are various ways to evaluate how the classifier work including the Accuracy, Precision, Recall, Dice Coefficient, Specificity, and Cosine Similarity. The regions in the image are being positive (or negative) depending on the type of the data where the decision resulting from the model can be true (i.e., correct) or false (i.e., incorrect) and it will be one of the possible classes which are True Positive (TP) which means that the model detected the malignant tumor (i.e., cancerous) as malignant, False Positive (FP) that refers to the benign tumor (i.e., non-cancerous) as malignant, True Negative (TN) which means that the model detected the benign tumor as benign, and False Negative (FN) which refers to the model detected the malignant tumor as benign. Table 5 shows the different utilized performance metrics in the current study.

Table 5 The different utilized performance metrics

3.7 Overall framework combination and pseudocode

The current subsection presents the framework combination and its pseudocode for grading and locating the prostate cancer. The pseudocode is presented in Algorithm 1.

Algorithm 1
figure a

Overall framework peseducode

4 Experiments and discussions

The experiments are divided into two categories (1) experiments related to segmentation and (2) experiments related to optimization, learning, and classification. The used scripting language is Python and the used packages are Tensorflow, Keras, NumPy, OpenCV, and Matplotlib. The scripts are run on Google Colab with its GPU (i.e., Intel(R) Xeon(R) CPU @ 2.00 GHz, Tesla T4 16 GB GPU, CUDA v.11.2, and 12 GB RAM).

4.1 The “PANDA: resized train data (512 × 512)” dataset experiments

The segmentation performance metrics using the U-Net model are reported in Table 6. It shows the segmentation scores for each grade. It shows an average segmentation accuracy of 98.46% using the “PANDA: Resized Train Data (512 × 512)” dataset. The average loss is 0.0368, AUC is 0.9778, IoU is 0.9865, and Dice is 0.9873.

Table 6 The segmentation performance metrics using the U-Net model

4.2 The “ISUP grade-wise prostate cancer” dataset experiments

Table 7 presents the finest combinations in each CNN structure using the “ISUP Grade-wise Prostate Cancer” dataset. These combinations have been optimized using AO. From this table, it can be noticed that KLDivergence loss is preferred by four models, the Standardization scaling technique and SGD parameters optimization are recommended by three models. In Table 8, the corresponding performance metrics for these finest combinations after the learning and optimization processes for each CNN model are presented. The best TP, TN, FP, and FN are 9,400, 52,883, 177, and 1,212, respectively. The best reported Accuracy, F1, Precision, Recall, Sensitivity, Specificity, AUC, IoU, Dice, Precision, and Cosine Similarity are 88.91%, 88.87%, 89.22%, 88.58%, 88.58%, 99.67%, 97.27%, 91.00%, 91.70%, 89.63%, and 89.80%, respectively, using MobileNet pretrained model.

Table 7 The best solutions combinations concerning each model using the “ISUP Grade-wise Prostate Cancer” dataset
Table 8 The performance metrics concerning each model using the “ISUP Grade-wise Prostate Cancer” dataset

4.3 The “transverse plane prostate dataset” dataset experiments

Table 9 presents the finest combinations in each CNN structure using the “Transverse Plane Prostate Dataset” dataset. These combinations have been optimized using AO. From this table, it can be noticed that the Squared Hinge and Poisson loss are preferred by three models, the standardization scaling technique is recommended by five models, and SGD Nesterov parameters optimization is recommended by four models. In Table 10, the corresponding performance metrics for the finest combinations after optimizing the CNN models are given. The best TP, TN, FP, and FN are 1,519, 1,519, 0, and 0, respectively. The best reported Accuracy, F1, Precision, Recall, Sensitivity, Specificity, AUC, IoU, Dice, Precision, and Cosine Similarity are 100%, 100%, 100%, 100%, 100%, 100%, 100%, 99.96%, 99.97%, 100%, and 100%, respectively, using MobileNet pretrained model.

Table 9 The best solutions combinations concerning each model using the “Transverse Plane Prostate Dataset” dataset
Table 10 The performance metrics concerning each model using the “Transverse Plane Prostate Dataset” dataset

4.4 Graphical summarizations

Figure 4 is a graphical representation of the classification results obtained for the “ISUP Grade-wise Prostate Cancer” dataset. It shows that the MobileNet gives the best performance according to the different metrices. Figure 5 is a graphical representation of the classification results obtained for the “Transverse Plane Prostate Dataset” dataset. It shows also that the MobileNet and ResNet152 are the best models.

Fig. 4
figure 4

Summarization of the learning an optimization experiments related to the “ISUP Grade-wise Prostate Cancer” dataset

Fig. 5
figure 5

Summarization of the learning and optimization experiments related to the “Transverse Plane Prostate Dataset” dataset

4.5 Related studies comparisons

Table 11 shows a comparison between the suggested approach and related studies.

Table 11 Comparison between the suggested approach and related studies

5 Conclusions, limitations, and future work

Prostate cancer is a common type of cancer worldwide. The mortality and morbidity of this type of cancer have increased dramatically in the last few years. Therefore, early and accurate diagnosis of prostate cancer is very important. In this study, we propose a hybrid framework for early and accurate classification and segmentation of prostate cancer using deep learning. This framework consists of two stages, namely classification stage and segmentation stage. In the classification stage, eight different CNN architectures via transfer learning were first fine-tuned using Aquila optimizer and then applied to diagnose patients with prostate cancer from normal ones. Once the patient is diagnosed with prostate cancer, images from the patient are passed to the segmentation phase in order to identify regions of interest. The importance of this phase is to find the shape, size, and volume of the tumor in order to help physicians in applying the correct treatment. The proposed framework is trained on three different datasets in order to generalize the proposed framework on different types of data. The proposed framework could achieve classification accuracies of 88.91% for the “ISUP Grade-wise Prostate Cancer” dataset and 100% for the “Transverse Plane Prostate Dataset” dataset. It shows an average segmentation accuracy of 98.46% using the “PANDA: Resized Train Data (512 × 512)” dataset.

Although the results of the study are promising, there are still some.

5.1 Limitations

For example, only one dataset is used for segmentation. The choice of only eight transfer learning models from all the available models is another limitation. Finally, the use of U-Net only for segmentation is among the limitations. Despite these limitations, the results of the current study are encouraging.

As a future work, different CNN architectures can be applied in the proposed framework. Other metaheuristic optimizers can be applied such as sparrow search algorithm. Finally, the proposed framework can be applied for diagnosis of other types of tumors such as brain tumor.