Introduction

Lung and colon cancers represent two of the most significant challenges in the field of oncology due to their high incidence and mortality rates globally. Effective and early diagnosis of these cancers is paramount for improving patient outcomes and survival rates. Histopathological examination of tissue samples, performed via microscopy, remains the gold standard for diagnosing these malignancies. However, the interpretation of histopathological images is highly dependent on the expertise of the pathologist and can be subject to variability. This underscores the need for more robust, accurate, and reproducible diagnostic methods that can augment human expertise.

In recent years, the field of medical imaging has been revolutionized by the advent of deep learning technologies, which have shown promise in enhancing the analysis of medical images, including histopathological images of cancerous tissues. Deep learning, a subset of machine learning, leverages neural networks with multiple layers to analyze various levels of abstraction in data. In medical applications, these technologies have been employed to identify, classify, and predict disease patterns with remarkable success.

Traditionally, the diagnosis of lung and colon cancers involves the histological examination of tissue sections under a microscope [1]. Pathologists review these sections to identify morphological features indicative of malignant transformations. The accuracy of these diagnoses heavily relies on the individual pathologist’s skill and experience, which can lead to significant inter-observer variability. Moreover, the process can be time-consuming and is often limited by the volume of cases that a pathologist can examine within a given timeframe. Figure 1 depicts the histopathological images of lung and colon cancer.

Fig. 1
figure 1

Histopathological images of lung colon cancer

While automated systems have been developed to assist pathologists by quantifying histopathological features, these systems often rely on conventional machine learning techniques that require manual feature extraction. This process can be labor-intensive and may not capture the nuanced features of the cancerous tissues effectively. Moreover, many existing automated systems do not generalize well across different datasets or medical centers due to variations in staining techniques, image capture conditions, and intrinsic biological heterogeneity.

Deep learning models, particularly convolutional neural networks (CNNs), have emerged as a powerful tool for medical image analysis, capable of automating feature extraction and providing robust generalizations across diverse datasets [2]. These models learn to identify disease markers directly from the images, reducing the reliance on manual feature labeling and potentially decreasing the variability introduced by human interpretation.

The Xception architecture, which extends the principles of Inception by replacing standard convolutions with depthwise separable convolutions, offers a high degree of model adaptability and learning capacity. On the other hand, MobileNet, designed for mobile and resource-constrained environments, utilizes depthwise separable convolutions to create lightweight, efficient models. By combining these two architectures, the approach harnesses both the depth and efficiency of these models, allowing for a sophisticated analysis of histopathological images that is both accurate and computationally feasible.

The proposed system integrates these two powerful architectures in a novel ensemble approach where both networks are first pre-trained on the ImageNet dataset to learn a wide range of image features. These pre-trained models are then fine-tuned on a curated dataset of lung and colon histopathological images, ensuring that the model specializes in features that are most predictive of cancerous conditions. The outputs of these networks are concatenated and fed into a series of dense layers, culminating in a classification layer that distinguishes between the different cancerous and non-cancerous tissue types.

An integral component of the methodology is the use of Gradient-weighted Class Activation Mapping (Grad-CAM), which provides visual explanations for the decisions made by the model. This technique generates heatmaps that highlight the areas within the images most influential to the model’s predictions, thereby offering insights into what the model is "seeing" when it makes a diagnosis. This transparency is crucial for clinical acceptance of AI tools, as it provides pathologists with understandable and interpretable evidence of the model’s diagnostic pathways.

Motivation

The motivation for this research stems from the critical need to enhance the accuracy and consistency of cancer diagnostics. Despite advances in imaging and computational tools, the potential of deep learning in medical imaging, particularly in improving the analysis and interpretation of histopathological images, has not been fully exploited. Existing automated systems often struggle with overfitting, limited generalizability across different datasets, and the inability to capture subtle yet crucial features in images. There is a pressing need for a robust model that not only improves diagnostic accuracy but also offers insights into its decision-making process, thereby aiding pathologists and enhancing trust in automated systems.

The objective of the paper is to:

  • Concatenate the outputs of Xception and MobileNet and feed them into dense layers, leading to a classification layer that distinguishes between cancerous and non-cancerous tissue types, enhancing diagnostic accuracy.

  • Utilize Gradient-weighted Class Activation Mapping (Grad-CAM) to generate heatmaps, offering visual explanations for the model's decisions and providing transparent insights into its diagnostic pathways, crucial for clinical acceptance.

  • Contribute to advancing medical diagnostics by providing a scalable solution applicable to various cancers and complex medical imaging tasks, fostering more automated, precise, and reliable diagnostic processes to meet the demands of modern healthcare systems.

The paper is structured as follows: Related Work reviews deep learning in cancer diagnosis. Methodology details the model, dataset, training, and Grad-CAM. Results and Discussion presents performance metrics and Grad-CAM findings. Conclusion and Future Work summarizes findings, applications, and future research.

Related work

The application of artificial intelligence in medical imaging, particularly using deep learning techniques, has marked a significant progression in the diagnosis and characterization of various cancers. The utilization of convolutional neural networks (CNNs) has been extensively studied due to their ability to automatically detect intricate patterns in imaging data that are often indiscernible to human eyes.

Before the widespread adoption of deep learning, several traditional machine learning methods were employed in medical image analysis. Techniques such as Support Vector Machines (SVM), Random Forests, and k-Nearest Neighbors (kNN) were commonly used. These models often required meticulous feature engineering and were generally limited by the handcrafted nature of feature extraction which impacted their efficacy in complex image classification tasks.

The shift from traditional models to CNNs introduced a significant breakthrough in medical diagnostics. CNNs, such as AlexNet, VGG, and Inception, have been applied to classify and predict various forms of cancers with promising results. These models eliminate the need for manual feature extraction by learning image representations directly from the data, leading to improved accuracy and efficiency in medical image analysis.

For lung and colon cancer, several studies have highlighted the use of specific architectures like MobileNet and Xception. MobileNet is a model designed for efficient performance in mobile and embedded vision applications, which has been adapted for medical imaging to handle resource constraints without compromising diagnostic accuracy. Xception model, which employs depthwise separable convolutions and has demonstrated superior performance on tasks requiring high-level feature extraction, such as distinguishing between different types of lung and colon tissue samples [3].

Recent studies have explored the integration of multiple deep learning models to leverage the strengths of various architectures. For instance, the combination of MobileNet and Xception has been examined for its potential to enhance feature representation and reduce overfitting, providing a more robust analysis of complex medical images. This approach aligns with the ongoing trend of ensemble models in deep learning, where multiple networks are used to improve predictive performance and reliability.

The importance of model interpretability in medical AI has also been increasingly recognized. Techniques such as Gradient-weighted Class Activation Mapping (Grad-CAM) allow clinicians to visualize which areas of an image influence the network’s predictions. This transparency is crucial for clinical acceptance, as it aids in the verification of AI-driven diagnostics by highlighting the decision-making process of the model.

The integration of deep learning technologies in medical imaging, especially for lung and colon cancer diagnosis, continues to evolve. This paper builds on these developments by proposing an innovative model that combines the strengths of MobileNet and Xception, enhanced by interpretative mechanisms such as Grad-CAM, to set new standards in accuracy and reliability in histopathological image analysis.

To provide a structured comparison of these studies, the Table 1 summarizes their contributions, methodologies, and key findings.

Table 1 Literature review

This table illustrates the diversity and progression in methodologies from pure image classification models to complex architectures integrating features like attention mechanisms and hybrid model configurations. Each study contributes uniquely towards refining the diagnostic processes for lung and colon cancers, demonstrating the ongoing advancements and potential of deep learning in medical imaging. This literature review forms the backdrop against which the current research is positioned, aiming to leverage and expand upon these recent innovations.

Methodology

This research investigates the application of convolutional neural networks (CNNs) in the classification of histopathological images of lung and colon cancer. The primary objective is to compare the performance of two well-known architectures, MobileNet and Xception, integrated into a single model framework to enhance classification accuracy. This section details the dataset, preprocessing steps, model architecture, training process, and evaluation metrics employed in the study. Figure 2 showcases the workflow of the proposed model.

Fig. 2
figure 2

Workflow of the proposed model

Dataset description

The Lung and Colon Cancer Histopathological Image Dataset (LC25000), developed by a team of researchers, provides a robust resource for machine learning research in medical diagnostics. This dataset contains 25,000 high-quality, de-identified, and HIPAA-compliant histopathological images divided into five classes: Colon Adenocarcinoma, Benign Colonic Tissue, Lung Adenocarcinoma, Lung Squamous Cell Carcinoma, and Benign Lung Tissue with each class contributing 5,000 images. These images are integral for developing algorithms capable of distinguishing between malignant and benign tissue samples in both colon and lung tissues [17]. Validated by expert pathologists and structured for machine learning applications, each image adheres to standardized formatting and resolution specifications to ensure consistency and reliability in training and testing AI models, thereby advancing the field of medical image analysis and the early detection of cancer. The Table 2 provides an overview of the dataset distribution across different classes. It helps illustrate the balance or imbalance within the dataset used for training the model. Figure 3 depicts the sample input images from lung and colon cancer histopathological image dataset.

Table 2 Dataset distribution
Fig. 3
figure 3

Sample input images from dataset

Data preprocessing

Data preprocessing is a crucial step in preparing raw images for effective model training. The first step involves resizing all images to 224 × 224 pixels to match the input size requirements of the pre-trained MobileNet and Xception models. This standardization ensures consistency across the dataset. Next, pixel values are normalized to the range [-1, 1], which helps in facilitating faster convergence during the training process by stabilizing the learning rate and improving numerical stability. To further enhance the robustness of the model against overfitting and to ensure generalization across different imaging conditions, several image augmentation techniques were implemented. Specifically:

  • Random rotations up to 20 degrees to simulate variations in sample positioning.

  • Width and height shifts up to 10% to mimic variations in the scanning field.

  • Horizontal flipping to represent mirror variations in image orientations.

  • Zoom augmentation up to 10% to simulate variations in the zoom level of imaging devices.

These augmentations help in making the model robust to variations that might be encountered in real-world diagnostic settings. Table 3 summarizes the image augmentation techniques used to enhance the training dataset:

Table 3 Summary of image augmentation techniques

Equation 1 provides the mathematical representation of resizing an image \(I\) to width \(w\) and height\(h\), ensuring uniform input size for the model and Eq. 2 describes the normalization process where pixel values \(( P )\) in the image are scaled to the range [-1, 1].

$$\text{Resized Image}={\text{resize}}\left(I,\left(w,h\right)\right)$$
(1)
$$P^{'}=\frac{P-\text{min}\left(P\right)}{\text{max}\left(P\right)-\text{min}\left(P\right)}\times 2-1$$
(2)

Finally, the dataset is divided into training (80%), validation (10%), and test (10%) sets. This split ensures that the model is trained on a substantial portion of the data, validated on a separate subset to tune hyperparameters and monitor performance, and tested on unseen data to evaluate its generalization ability. This comprehensive preprocessing pipeline is essential for building a robust and effective image classification model. Table 4 details about how the dataset was divided into training, validation, and testing sets. Equation 3 splits dataset \(( D )\) into training and testing sets based on ratio \(( r ).\)

Table 4 Data splitting
$$\text{Train Set},\text{Test Set}={\text{split}}\left(D,r\right)$$
(3)

Model architecture

The model architecture described leverages the strengths of two powerful convolutional neural networks (CNNs), MobileNet and Xception, which are widely used in image classification tasks. By combining these models, the approach aims to enhance the feature extraction capabilities, thus improving classification performance for histopathological image analysis.

Depthwise separable convolutions are a refined type of convolution that significantly reduces the computational cost and number of parameters compared to traditional convolutions. This convolution technique decomposes the standard convolution operation into two layers: depthwise and pointwise convolutions. In depthwise convolution, a single filter is applied per input channel, mapping each input channel to an output channel independently, thereby not mixing inputs across channels. This is followed by pointwise convolution, which uses 1 × 1 convolutions to combine the outputs of the depthwise convolution across channels. This method enhances the model's efficiency by reducing the computational load and the model size while maintaining effective feature extraction. Equation 4 explains the mathematical operation behind depthwise separable convolutions used in MobileNet and Xception, highlighting their computational efficiency.

$${Y}_{k,l,n}={\sum }_{i,j,m}{K}_{i,j,m,n}\cdot {X}_{k+i,l+j,m}$$
(4)

MobileNet leverages depthwise separable convolutions to offer an architecture optimized for mobile and embedded applications. By simplifying the convolution process into depthwise and pointwise steps, MobileNet dramatically reduces computational requirements, making it suitable for devices with limited processing power. It introduces tunable parameters such as width multiplier and resolution multiplier, which allow for flexible architecture scaling to balance between speed and accuracy. MobileNet's efficiency makes it particularly useful for real-time applications on mobile devices, including face recognition and augmented reality. Xception, or "Extreme Inception," advances the design of the Inception models by employing depthwise separable convolutions across its architecture, replacing traditional convolutions. This adjustment not only reduces the model's parameter count but also enhances its capability to process high-resolution images effectively. Xception’s architecture facilitates the independent learning of spatial features and channel-wise features by separating the convolution operations, which improves the model’s ability to discern more complex patterns in the data. Due to its efficient handling of model parameters and depthwise learning approach, Xception is well-suited for tasks that involve large-scale learning from extensive, high-dimensional data sets, such as detailed image classification tasks.

Both models are initialized with weights from training on the ImageNet dataset, a large dataset featuring over a million images across 1000 categories [18]. Using pre-trained weights allows the model to start with a robust understanding of visual features that are applicable across a wide range of image classification tasks.

In the initial training phase, the convolutional bases of both MobileNet and Xception are frozen. This means that the weights in these layers do not change during the first few training epochs. Freezing is critical because it prevents the well-learned features from being distorted while the latter parts of the network are being trained. This is especially beneficial when the new dataset (histopathological images in this case) has relatively fewer examples compared to the original ImageNet dataset. Equation 5 is the modified convolution equation incorporating strides \(( s )\) to reduce the dimensionality of the feature maps.

$${F}_{xy}={\sum }_{a=1}^{A}{\sum }_{b=1}^{B}{I}_{sx+a,sy+b}\cdot {K}_{a,b}$$
(5)

After feature extraction through both the MobileNet and Xception networks, the outputs are concatenated. MobileNet captures more granular details due to its lightweight architecture, whereas Xception, leveraging a deeper and more complex structure, might extract higher-level features. Concatenating these features ensures that the final feature set is both comprehensive and robust, capturing a broad spectrum of image characteristics at different levels of abstraction. Equation 6 mathematically represents the concatenation of features from MobileNet and Xception.

$$C={\text{Concat}}\left(A,B\right)$$
(6)

Concatenation layer involves merging the output tensors of both networks. If both feature extractors output feature maps of the same dimensions, these can be concatenated along the depth axis to form a single, unified feature map. This concatenated feature map then serves as the input to the subsequent layers of the network. Equation 7 Shows how concatenated features are linearly combined to form the input to the next layer.

$${O}_{i}={\sum }_{j\in S}{W}_{i,j}\cdot {C}_{j}+{b}_{i}$$
(7)

The classification head is the part of the model that makes the final prediction. The first step in the classification head is to flatten the 3D output of the concatenated feature maps into a 1D vector. This transformation prepares the data for processing in the fully connected dense layers. Following flattening, the feature vector is passed through a series of dense layers. Each dense layer consists of a set number of neurons, each with ReLU (Rectified Linear Unit) activation. ReLU is chosen for its ability to introduce non-linearity into the network, helping to learn complex patterns in the data [19]. The number of neurons and layers can be tuned based on the complexity of the task and the computational resources available. Equation 8 Clarifies the non-linear activation function used in the neural networks, essential for learning complex patterns.

$$f\left(x\right)=\text{max}\left(0,x\right)$$
(8)

In the architecture of our proposed model, the flatten layer plays a pivotal role by transforming the complex, multi-dimensional feature maps produced by the convolutional layers into a flat, one-dimensional array. This transformation is essential as it bridges the convolutional base of the network, which is adept at spatial feature extraction, with the dense layers that perform classification. The flatten layer ensures that the spatial relationships within the feature maps are linearized, allowing the dense layers to effectively learn from the entirety of the extracted features for accurate cancer classification.

Following the concatenation of outputs from MobileNet and Xception, our model includes several dense layers designed to effectively harness and interpret the rich feature sets provided by these two powerful networks. The sequence begins with a Flatten layer, transitioning the multidimensional feature maps into a one-dimensional array, which is then processed through three dense layers: a first dense layer with 1024 neurons utilizing ReLU activation to facilitate the learning of complex, high-level features from the concatenated inputs, a subsequent second dense layer with 512 neurons also equipped with ReLU activation to further refine these features, and a third dense layer with 256 neurons continuing the pattern of ReLU activation to prepare the refined features for final classification. The culmination of this dense layer sequence is an output layer that employs softmax activation to produce a probabilistic distribution across the defined classes, enabling precise cancer classification based on histopathological images. This structured approach in our dense layers ensures robust learning and contributes significantly to the high accuracy and reliability of our model in medical diagnostics. Equation 9 describes the output layer's activation function, which normalizes the output to a probability distribution over predicted output class.

$$\upsigma {\left(z\right)}_{j}=\frac{{e}^{{z}_{j}}}{{\sum }_{k=1}^{K}{e}^{{z}_{k}}}\text{ for }j=1,\dots ,K$$
(9)

Algorithm I expresses the steps included in the proposed model.

figure a

Algorithm I: Proposed model Algorithm

The architecture is designed for deep feature extraction followed by classification, leveraging the strengths of both MobileNet and Xception models for a potentially robust performance in image classification tasks.

Grad-Cam implementation

Gradient-weighted Class Activation Mapping (Grad-CAM) is a technique designed to enhance the interpretability of convolutional neural networks (CNNs), especially those applied in vision-related tasks. Grad-CAM aids in visualizing which regions of an input image are significant for the model’s predictions, making it particularly valuable in fields like medical image analysis where understanding the model's decision-making process is crucial. The process involves several steps: a forward pass where the image is processed through the CNN resulting in feature maps; the selection of a target convolutional layer to focus on, which provides a spatial map highlighting crucial image areas for the class prediction; the computation of gradients of the class score with respect to these feature maps, followed by a global-average-pooling to derive importance weights for each channel; and finally, a weighted combination of these maps to produce a coarse heatmap. A ReLU function is then applied to this heatmap to focus only on the features that positively influence the class, enhancing the interpretative utility of the CNN. Figure 4 showcases the images after applying Grad-CAM.

Fig. 4
figure 4

Grad-CAM images

This architecture, combining two powerful pre-trained models and a robust classification head, is designed to effectively classify complex medical images, providing both high accuracy and efficient computation.

Training procedure

Training a deep learning model effectively requires careful consideration of various components, including the choice of optimizer, loss function, learning rate adjustments, and techniques to prevent overfitting [20, 21]. Our model uses the Adamax optimizer, known for handling sparse gradients effectively. The initial learning rate is set at 0.001, with dynamic adjustments made using a custom Learning Rate Adjustment (LRA) callback based on training performance. We use a batch size of 32 for training to balance computational efficiency and performance, with test batch size adjusted to fit memory constraints. The loss function is categorical crossentropy, ideal for our multi-class classification task. An early stopping mechanism monitors validation loss with a patience of 10 epochs to prevent overfitting. These settings ensure robust performance across different data distributions, facilitating efficient model convergence and high accuracy, which is crucial for medical image analysis. Equation 10 details the update rule for Adamax, emphasizing its role in adjusting the model weights.

$${\uptheta }_{t+1}={\uptheta }_{t}-\frac{\upeta }{\sqrt{\widehat{{v}_{t}}}+\upepsilon }\widehat{{m}_{t}}$$
(10)

Categorical crossentropy is selected as the loss function, which is appropriate for multi-class classification tasks in Eq. 11:

$$\mathcal{L}=-{\sum }_{i=1}^{C}{t}_{i}\text{log}\left({p}_{i}\right)$$
(11)

This loss function penalizes incorrect classifications more heavily, ensuring that the model learns to output probabilities close to the actual class labels.

To enhance the training process, a custom Learning Rate Adjustment (LRA) callback has been developed. This callback utilizes a dynamic mechanism to adjust the learning rate based on ongoing training performance, as outlined by the Eq. 12:

$${\upeta }_{\text{new}}=\upeta \cdot \text{decay}{\_}\text{factor}$$
(12)

This method ensures that the learning rate is fine-tuned in response to the model’s progress, promoting more effective learning and convergence.

The LRA callback actively monitors both training accuracy and validation loss during the training phase. If the training accuracy fails to meet a set threshold or if the validation loss does not show improvement over a defined number of epochs specified by the patience parameter the learning rate is adjusted downward. This patience mechanism is crucial as it prevents premature adjustments of the learning rate, allowing sufficient time for the model to explore the solution space thoroughly and avoid local minima.

When the conditions for adjustment are met, the learning rate is decreased by a predetermined factor, such as 0.5. This reduction helps in subtly fine-tuning the model’s weights, which is essential for achieving optimal performance without causing significant disruptions that could negatively impact the training dynamics.

Additionally, the callback includes an optional 'dwell' parameter, which offers a safety net by reverting the model to the best weights observed before the plateau or decline in performance.

Early stopping is employed to prevent overfitting and ensure the model generalizes well to unseen data. Early stopping monitors the validation loss. If the validation loss does not improve for a set number of consecutive epochs (patience), training is halted. Equation 13 provides the criterion for stopping the training process early to prevent overfitting.

$$\text{If }\Delta {\text{loss}}<\text{threshold for }n\text{ epochs, stop training}$$
(13)

The choice of the number of epochs and batch size is critical to ensure efficient training without overfitting. The model is trained for a maximum of 10 epochs. This is a preliminary setting to evaluate the model's convergence. A batch size of 32 is chosen for training. This batch size provides a balance between computational efficiency and model performance. It allows the optimizer to update weights more frequently within each epoch, which can lead to faster convergence. The test batch size is chosen such that the entire test set can be divided into batches evenly, optimizing memory usage and processing time [22].

The decision to train our model for up to 10 epochs was strategically made based on the model’s performance during preliminary trials and the incorporation of an early stopping mechanism. This mechanism is designed to halt training once the validation loss ceases to decrease over a series of epochs, effectively preventing overtraining and ensuring model robustness. Our experimental results demonstrated that further increases in the number of training epochs did not substantially enhance model performance, indicating that the optimal learning capacity was achieved within the first 10 epochs. Additionally, this approach aligns with our goal to maintain computational efficiency and reduce the environmental impact of extensive training sessions, without sacrificing the model's performance and accuracy in diagnosing lung and colon cancers.

Statistical methods

Evaluating the performance of a classification model, especially in medical imaging, requires a comprehensive set of metrics to ensure robustness, reliability, and generalizability. This study employs several evaluation metrics to assess the performance of the model in classifying histopathological images. Each metric provides unique insights into the model's strengths and weaknesses.

Accuracy is the proportion of correctly classified instances among the total instances evaluated. It is one of the most straightforward metrics for evaluating classification models. Accuracy gives a general sense of how well the model performs across all classes and is solved using the formula in the Eq. 14. However, it may not be sufficient alone, especially in cases of class imbalance where the number of instances per class varies significantly.

$${\text{Accuracy}}=\frac{\text{Number of correct predictions}}{\text{Total number of predictions}}$$
(14)

Precision, also known as positive predictive value, measures the accuracy of the positive predictions made by the classifier. High precision indicates that the model has a low false positive rate. In medical imaging, this is crucial to minimize the misdiagnosis of non-cancerous images as cancerous. Equation 15 showcases formula to calculate precision.

$${\text{Precision}}=\frac{TP}{TP+FP}$$
(15)

Recall, also known as sensitivity or true positive rate, measures the ability of the classifier to identify all relevant instances (true positives) and is calculated using the formula in Eq. 16. High recall indicates that the model successfully identifies most of the positive instances. In medical applications, high recall is essential to ensure that most cancerous images are correctly identified.

$${\text{Recall}}=\frac{TP}{TP+FN}$$
(16)

The F1-score is the harmonic mean of precision and recall, providing a single metric that balances both. It is particularly useful when the class distribution is imbalanced. The F1-score ranges from 0 to 1, with 1 being the best possible score. It offers a more comprehensive evaluation than accuracy alone by considering both false positives and false negatives. F1 score can be calculated by using the formula in Eq. 17.

$$F1=2\cdot \frac{{\text{Precision}}\cdot {\text{Recall}}}{{\text{Precision}}+{\text{Recall}}}$$
(17)

The Area Under the Receiver Operating Characteristic Curve (AUC-ROC) is a performance measurement for classification problems at various threshold settings and can be calculated using formula in Eq. 18. The ROC curve is a plot of the true positive rate (recall) against the false positive rate. The AUC value ranges from 0 to 1. A higher AUC indicates better model performance. An AUC of 0.5 suggests no discriminative ability, whereas an AUC close to 1 indicates excellent model performance. The AUC-ROC curve provides a visual representation of the trade-off between true positive and false positive rates, helping to assess the model's discriminatory power.

$$AUC={\int }_{0}^{1}TPR\left(t\right)\hspace{0.17em}dFPR\left(t\right)$$
(18)

A confusion matrix is a table used to describe the performance of a classification model. It shows the number of true positive, true negative, false positive, and false negative predictions. The confusion matrix provides a detailed breakdown of the model's performance across all classes, highlighting potential biases towards particular classes. It allows the identification of specific classes where the model may struggle, enabling targeted improvements.

The use of multiple evaluation metrics ensures a comprehensive assessment of the model's performance. Accuracy provides a general overview, while precision and recall give insights into the handling of false positives and false negatives. The F1-score offers a balanced measure, the AUC-ROC curve evaluates discriminative ability, and the confusion matrix provides a detailed breakdown of performance across all classes. This multifaceted evaluation approach is crucial in medical imaging applications, where the implications of misclassification are significant.

This methodology section provides a detailed description of the processes involved in training a deep learning model for the classification of histopathological images. By employing advanced CNN architectures and a robust training procedure, the study aims to achieve high accuracy and reliability, which are crucial for medical image analysis.

Experiment and analysis

The experimental evaluation was structured to assess the effectiveness of the integrated MobileNet and Xception model in classifying histopathological images of lung and colon cancer. The LC25000 dataset was utilized, involving an extensive series of experiments to validate the model's performance under various conditions.

Training details

The model was trained for up to 10 epochs, with training potentially stopping earlier based on the early stopping criteria, which monitored validation loss. The early stopping mechanism ensured that the model did not continue to train once performance ceased to improve, thus preventing overfitting. The batch size for training was set to 32, which is a common choice that balances computational efficiency and model performance. For the test set, an appropriate batch size was calculated to ensure that the entire test dataset was processed without leaving out any samples, optimizing memory usage and processing time. Figure 5 depicts the graphical representation of training and validation loss and accuracy.

Fig. 5
figure 5

Graphical representation of training and validation

The optimizer used was Adamax, a variant of the Adam optimizer known for its effectiveness in handling sparse gradients, which are common in high-dimensional data such as images. The initial learning rate was set to 0.002 and was dynamically adjusted according to a custom Learning Rate Adjustment (LRA) callback. This callback monitored both training accuracy and validation loss, making necessary adjustments to the learning rate to optimize the training process.

The loss function employed was categorical crossentropy, which is appropriate for multi-class classification tasks. This function computes the loss by comparing the predicted probabilities to the actual class labels, penalizing incorrect predictions more heavily. This ensures that the model learns to output probabilities close to the actual class labels, thus improving classification accuracy. The Fig. 6 showcases the learning curve of the model.

Fig. 6
figure 6

Learning curve

The model achieved an overall accuracy of 99.44% on the test set, demonstrating exceptional performance across various classes. Precision ranged from 0.9559 for Lung Squamous Cell Carcinoma (SCC) to 1.0000 for both Colon Adenocarcinoma (ACA) and benign Lung tissue (Lung N). Recall values were also high, with the lowest being 0.9508 for Lung Adenocarcinoma (ACA), indicating the model's strong ability to correctly identify true positive cases. The F1-scores were consistently high across all classes, reflecting a balanced performance between precision and recall. These results underscore the model's robustness and effectiveness in accurately classifying histopathological images of lung and colon cancers. Figure 7 is the heatmap of classification report of the proposed model.

Fig. 7
figure 7

Classification report

To gain a deeper understanding of the model's performance, a detailed error analysis was conducted, scrutinizing misclassifications to identify patterns and potential causes. Key findings from this analysis include inter-class confusion, where most misclassifications occurred between visually similar classes such as Lung Squamous Cell Carcinoma (SCC) and Lung Adenocarcinoma (ACA), suggesting a need for the model to learn more discriminative features specific to each subtype. Additionally, errors were more frequent in images with lower resolution or poor staining quality, indicating the importance of consistent image quality to enhance model reliability. Despite the dataset being balanced, intrinsic variations in texture and pattern complexity within classes might have contributed to biased error rates towards more complex patterns, commonly seen in benign versus malignant distinctions. Furthermore, Grad-CAM visualizations of misclassified cases revealed that incorrect predictions often focused on non-relevant regions of the images, indicating potential improvements needed in the model's region of interest detection capabilities. Figure 8 is the image showcasing truly classified instances whereas Fig. 9 representing misclassified instances.

Fig. 8
figure 8

Truly classified instances

Fig. 9
figure 9

Misclassified instances

The AUC-ROC curves for each class demonstrated excellent discriminative performance, with values close to 1. This indicates that the model has a high true positive rate while maintaining a low false positive rate across all classes. Figure 10 is the graph representing ROC curve of the model.

Fig. 10
figure 10

ROC curve

The confusion matrix provided a detailed breakdown of performance, highlighting few confusions between Lung SCC and Lung ACA, underscoring areas for potential improvement in future model iterations. Figure 11 depicts the confusion matrix of the proposed model.

Fig. 11
figure 11

Confusion matrix

To contextualize the performance of the integrated MobileNet and Xception model, compared it with existing methodologies in histopathological image classification. Table 5 shows comparison of proposed model with existing model and their techniques.

Table 5 Comparative study

The integrated MobileNet and Xception model demonstrates a high degree of accuracy and robustness in classifying histopathological images of lung and colon cancers. The architecture effectively captures and utilizes complex features from the images, significantly outperforming traditional models. Despite the advancements, the proposed model, like many deep learning approaches, faces potential limitations that could impact its broad applicability in clinical settings. A significant limitation is the model’s reliance on the specific dataset used for training. While the LC25000 dataset provides a robust platform for developing and testing the model, its unique characteristics, such as image acquisition techniques, staining protocols, and the distribution of cancer types, may not be representative of broader clinical environments. This could potentially limit the model's performance when applied to data from different hospitals or labs with varying imaging standards.

Furthermore, there is a risk of inherent biases within the dataset, such as overrepresentation or underrepresentation of certain demographic groups or cancer stages, which could skew the model's predictions. Such biases might result in reduced accuracy when the model is deployed in diverse real-world scenarios, where the distribution of cases might differ significantly from the training dataset.

Conclusion

The integration of Xception and MobileNet architectures, along with the application of Gradient-weighted Class Activation Mapping (Grad-CAM), has significantly advanced the classification of histopathological images for lung and colon cancers. This study has demonstrated the capability of these combined architectures to enhance feature extraction, improve generalizability, and reduce overfitting, achieving an accuracy rate of 99.44% on a balanced test set. The precision and recall metrics indicate the model's exceptional performance in identifying specific cancerous and non-cancerous tissue types, with several categories achieving perfect scores.

The utilization of Grad-CAM has further augmented the interpretability of the model, offering a transformative approach to understanding deep learning decisions in medical diagnostics. By generating heatmaps that highlight influential regions used by the CNN for making predictions, Grad-CAM enables clinicians to visually verify these automated insights. This visualization facilitates a deeper understanding and trust in the model's functionality, crucial for integrating AI tools into clinical workflows. Moreover, such detailed visual explanations help in educational settings, where medical professionals can observe how advanced models discern subtle nuances in histopathological images that may be overlooked in standard examinations. By enhancing the transparency of the model’s decision-making process through Grad-CAM visualizations, it not only bolsters diagnostic confidence but also assists in advancing the discussion regarding AI's role in augmenting diagnostic accuracy and reliability in clinical settings. This step forward is pivotal for the adoption of AI in routine clinical practices, ensuring that AI-supported diagnostics are both interpretable and verifiable by expert clinicians.

The findings from this research suggest that the sophisticated blending of neural network architectures offers a promising pathway to refining diagnostic processes in pathology. The architecture of the combined MobileNet and Xception models, enhanced with Grad-CAM for interpretability, provides a robust foundation that can potentially be adapted for a wide range of diagnostic tasks beyond lung and colon cancers. This adaptability is crucial for extending the model’s application to other types of cancer, such as breast, skin, or prostate cancers, where histopathological analysis plays a pivotal role in diagnosis.

Moreover, the modular nature of the proposed model framework allows for flexibility in tuning and retraining for different medical imaging tasks, such as MRI analysis, CT scans, or ultrasound image interpretation. By retraining the model on specific datasets corresponding to these tasks, or by integrating other specialized neural network layers tailored for different imaging modalities, the model can be made suitable for a broad spectrum of diagnostic applications.

While the current model demonstrates high accuracy and reliability in lung and colon cancer classification, its potential for generalization to other medical imaging tasks holds promise. With further development and validation, this approach could significantly contribute to the advancement of automated, precise, and reliable diagnostic processes across various domains of medical imaging. Future research could explore the application of this combined model framework to other types of cancer and more complex diagnostic tasks, potentially extending its utility to broader medical imaging applications. The pursuit of these advancements could pave the way for more automated, accurate, and reliable diagnostic processes, ultimately enhancing patient outcomes in oncology.