1 Introduction

Cancer is a significant cause of death globally, with nearly 10 million deaths in 2020 [38]. Around 70 to 80% of cancer occurs in epithelial tissue cells, which is called as carcinoma. Squamous Cell Carcinoma (SCC) is the most common type of carcinoma in head and neck region, skin, lungs, esophagus and cervix. Histopathological analysis of tissue biopsy remains the gold standard approach for cancer diagnosis. Various clinicopathological features such as keratinization, keratin pearl formation and cells arrangement pattern are used to screen SCC initially [3, 5]. Keratin pearl is a form of keratinized structure, frequently visible at low-grade SCC and rarely observed at high-grade SCC [3]. This feature is more clearly noticeable at low magnification images (25x, 40x). Figure 1 depicts the formation of keratin pearl in SCC images captured at 40x magnification. In practice, pathologists observe the various histopathological features in SCC samples at low and high magnifications such as 25x, 40x, 100x, 200x and 400x for diagnosis. However, the manual evaluation is time-consuming and suffers from inter/intraobserver variability. Hence, there is urgent need for an automated quantification of keratinized area and keratin pearl detection from histopathological images. Early detection could undoubtedly improve the survival rate and help better treatment. In this regard, computer-assisted diagnostic system could be helpful for clinicians to obtain faster and objective diagnostic report.

Fig. 1
figure 1

Keratin pearl structure captured at 40x magnification

Several studies reported using traditional image processing and deep learning approaches for SCC detection and classification [4, 6, 7, 10, 13, 22, 28,29,30, 32, 33, 43, 45,46,47]. Although keratin pearl detection is an initial screening criterion of SCC diagnosis, to date, minimal attempts were reported in the literature to detect keratin pearl [3, 5]. This study aims to investigate keratin pearl detection using a single-stage object detection model rather than pixel-based deep networks.

We consider the object detection model because it reduces annotation time by drawing the bounding boxes around objects which is much simpler than pixel-based annotation. Nevertheless, identifying certain histopathological features such as keratin pearls, nuclei and mitotic figures can aid in making diagnostic decisions rather than pixel based segmentation. This study presents the utilization of object detection techniques to identify keratin pearls in histopathological images, aiming to assist pathologists in the detection of SCC. We modified default RetinaNet model architecture to detect small keratin pearl and to improve the detection confidence. The primary contribution of this study are listed below.

  1. 1.

    A novel object detection method for detection of keratin pearls

  2. 2.

    An attention module on top of RetinaNet backbone network to support the detection of small scale keratin pearls

The structure of the paper is outlined as follows. Section 2 presents a short summary of recent articles on SCC detection and classification. In Section 3, we discuss the materials and methods utilized in this study. Section 4 presents the results of the proposed method and its comparison with state-of-the-art models. In Section 5, the paper concludes with a summary of key findings.

2 Related work

In the literature, various approaches were reported for detection and classification of SCC with respect to different organs of SCC origin [10, 22, 28, 30, 32, 43, 47]. Additionally, carcinoma features such as increased mitotic division [4], irregular nuclei shape and size [13, 29, 33, 46] were detected using machine learning algorithms and deep learning approaches [3, 5, 7, 10, 22, 28, 30, 32, 43, 45, 47]. However, a couple of studies were reported on keratinization and keratin pearl detection using Chan-Vese model [3] and Random Forest tree classifier with the help of Gabor features [5].

Deep learning-based methods were used to extract nuclei from SCC histopathological images [6, 7, 45]. Semantic segmentation models such as FC, U-Net, SegNet, PSPNet and DeepLab were exploited in histopathological images for microscopic feature segmentation [12, 16, 21, 31]. As the ground truth, these models require pixel-based annotations, which is a time-consuming task. Sometimes pixel-based segmentation may be unnecessary when certain microscopic features, such as the presence of a keratin pearl are sufficient to characterize early stage SCC. In that case, object detection models could be used to detect such features with a much simpler annotation format. Bounding box annotation around objects saves a lot of time. Most of the studies reported using object detection models such as Faster-RCNN, Mask-RCNN, YOLO, Single Shot MultiBox Detector (SSD) and RetinaNet on natural scene dataset, ImageNet dataset, COCO dataset and other object detection datasets. Recently, these models have been used in the medical field for different imaging modalities such as endoscopy images, CT images, X-rays and histopathological images to identify cells, nuclei and mitotic division, cancer lesions, malaria parasites and organ segmentation [1, 11, 17,18,19, 27, 35, 36, 41, 44].

Mask R-CNN is a combination of Faster RCNN and FCN architecture. It provides object bounding boxes, classes and binary masks that indicate the object’s pixels in the bounding box. Jung et al. [18] and Sebai et al. [35] investigated the use of Mask R-CNN for nuclei and mitotic cells segmentation, respectively on histopathological images. However, Mask R-CNN has achieved good performance on natural images, while its performance is worse on the medical image dataset [36]. Hoorali et al. [14] introduced an automatic method to diagnose disease from microscopic images using attention based Mask R-CNN. Khan et al. [19] proposed modified Faster R-CNN with dilated convolutions in the backbone model to improve the detection of mitotic nuclei in histopathological images They achieved average precision of 50.31%. Further, the authors in [2] presented an automated firearms detection system for cargo x-ray images. In their proposed model, RetinaNet outperforms two-stage R-CNN model in terms of detection performance. Also, it matches the speed of one-stage object detection algorithms, namely YOLO and SSD. A few authors also performed comparative study between the object detection models such as YOLOV3, YOLOV5, SSD and RetinaNet [8, 16, 20, 39], and findings are presented in Table 1.

Also, Table 1 provides a summary of different techniques used to extract microscopic features from histopathological images. Most of the studies utilized segmentation models for this purpose, but there is a growing trend in recent studies to use Faster R-CNN and Mask R-CNN for feature extraction. This has sparked our interest in exploring the use of object detection models for analyzing histopathological images. The aim of the object detection is to identify the precise location of objects within images and recognize their respective category. In this study, we present RetinaNet architecture to detect one of the significant SCC feature namely keratin pearl, in histopathological images. Further, we revisit RetinaNet to address the specific challenge of detecting small keratin pearls and to increase the detected objects\('\) confidence.

Table 1 Summary of literature

3 Materials and methods

Section 3.1 covers the experiment details followed by a summary of the RetinaNet model in Section 3.2. In Section 3.3, we provide a detailed architecture of the proposed methodology.

3.1 Experimental details

This section provides a detailed explanation of experiment setup. Section 3.1.1 presents information on the image acquisition method and the total number of images used in the study. Section 3.1.2 explains the system requirements of the proposed architecture.

3.1.1 Image acquisition

The proposed workflow of keratin pearl detection uses 101 images captured using DP21 camera attached to the Olympus CX31 microscope. The images are collected at 40x magnification (objective lens: 4x; eyepiece: 10x) with image resolution of 1600x1200 pixels. The slides are collected from the pathology department of Kasturba Medical College (KMC), Manipal, India. The biopsy slides are prepared with Hematoxylin and Eosin (H &E) staining technique. These images are annotated using LabelImg software [40] to mark the bounding boxes around the keratin pearl. A total of 725 instances of keratin pearl are present in training and validation sets. However we increased the dataset size by introducing data augmentation technique. The dataset is split into 90% for training set and 10% for testing set. We also detected keratin pearl present at 100x magnified images to test the model’s robustness.

3.1.2 System requirements

The experiment is performed on CUDA 10.1 toolkit with cuDNN 7 library on 64-bit Ubuntu 16.04 operating system. All experiments are performed with DELL rack server, model PowerEdge R740 with Intel(R) Xeon(R) Gold 6136 CPU @3.00GHz, RAM 128 GB, hard disk 8TB+4TB RAID, GPU NVIDIA Tesla P-100 (16 GB), Python 3.7 with Jupyter hub multiuser environment. Object detection models are implemented using open-source libraries such as OpenCV, CuDNN and Keras with Tensorflow as the backend.

3.2 RetinaNet

This study presented keratin pearl detection using RetinaNet model and compared the results with state-of-the-art object detection models and traditional approach. Further, we modified RetinaNet model by incorporating an attention mechanism. The default anchor scales and ratios are not effective for our dataset. Hence, we separately calculated the anchor ratios and scales based on object sizes in the training dataset. We also replaced standard convolution with depth-wise separable convolution at the last layers of RetinaNet model to significantly reduce the number of parameters.

As RetinaNet feature maps become deeper, the object’s edge definition gets blurred and the corresponding regression becomes weak. On the other hand, the deep feature map has low resolution, posing challenges in detecting small objects within the feature map. Hence, we introduced an attention module to find the interdependence between the channels of its convolution features that boost performance at a little extra computational cost.

This section provides RetinaNet model default architecture and further, we elaborate the proposed model workflow.

RetinaNet [24] is a one-stage object detector model developed to address the imbalances and inconsistencies that exist in single-stage object detection models such as YOLO and SSD. Two crucial building blocks of RetinaNet are feature pyramid architecture and the use of focal loss. The model comprises two main components:1. backbone network that includes feature extractor and Feature Pyramid Network (FPN) [23] 2. subnetwork block includes classification network with focal loss function [24] and box regression network with smooth \(L_1\) loss function. RetinaNet by default uses ResNet-50 as a feature extractor, shown in Fig. 2. RetinaNet uses c2, c3, c4 and c5 blocks of ResNet50 as the backbone for feature extraction. Additionally, convolutional operations are applied to the c5 blocks and two additional blocks c6 and c7 are introduced. These blocks are incorporated into the FPN to compute feature maps at various scales. It generates a feature pyramid with multiple scales by using a top-down approach that includes lateral connections. Thus FPN facilitates the detection of objects at different scales. We implemented default RetinaNet model for the detection of keratin pearl. Further, we proposed RetinaNet model with attention module to address the specific challenge of detecting small-size keratin pearls and also to improve Intersection over Union (IoU) of detected objects.

3.3 Proposed architecture

In the proposed architecture, the detector is a modified RetinaNet model which is described in Fig. 3. The proposed model incorporates an attention module on top of backbone network and is discussed in Section 3.3.1. To calculate anchor scales and ratios, we utilize the differential evolution search algorithm [37]. In addition, we use depth-wise separable convolution, which requires fewer multiplications compared to standard convolution and reduces model parameters, as explained in Section 3.3.2. The loss function is discussed in Section 3.3.3. We optimize the hyperparameters of the RetinaNet model and discussed in Section 3.3.4. Overall, this section provides a detailed explanation of the proposed RetinaNet model design, including the various steps we followed to create it.

Fig. 2
figure 2

RetinaNet architecture with ResNet-50 backbone

Fig. 3
figure 3

The proposed model architecture

3.3.1 Attention module

Our proposed study involves channel-based attention module to identify inter-dependencies existing among the feature channels. The Squeeze and Excitation Networks [15] is considered as a reference to the proposed model. This module extracts a degree of importance of the feature channel by reducing the feature maps of each channel to a single numerical value. We performed pooling operation to reduce the spatial dimensions of the feature maps. This study used global max pooling and global average pooling operations. The most activating pixels are preserved with max pooling, while average pooling creates a smoothed average of all the pixels in that window. Later, the results of the pooling operations are fed into fully connected multi-layer perceptron bottleneck structure to learn the adaptive scaling weights of these channels. Figure 4 shows the proposed attention module architecture. It includes 6 steps and are described as below.

Fig. 4
figure 4

Attention module

  • Step 1: Generate the feature maps by employing the backbone network.

  • Step 2: Perform the global max pooling and average pooling operation on feature maps (also known as feature channels) of \(C_3\), \(C_4\), \(C_5\) \(C_6\) or \(C_7\) block.

  • Step 3: Add the results obtained from pooling operation as shown in Equation (1).

    $$\begin{aligned} z_k = F_a + F_b \end{aligned}$$
    (1)

    where \(z_k\) is a scalar value and we named it as feature descriptor of the \(k^{th}\) channel. \(F_a\) and \(F_b\) are represented in (2) and (3).

    $$\begin{aligned} F_a= Avg(\sum _{i=1}^{W} \sum _{j=1}^{H} (c^{[block]}_k(i,j))) \end{aligned}$$
    (2)
    $$\begin{aligned} F_b=Max(\sum _{i=1} ^{W} \sum _{j=1}^{H} (c^{[block]}_k(i,j))) \end{aligned}$$
    (3)

    In Equation (2) and (3) we performed global average pooling and global max pooling operation, respectively. W and H represent width and height of the feature map. The term \(c^{[block]}_k\) is the \(k^{th}\) feature map or channel of a specified backbone block namely \(C_3\), \(C_4\), \(C_5\) \(C_6\) or \(C_7\). The process of step 2 and step 3 is applied to each channel of the feature maps separately. The output tensor is of size \((1\times 1 \times C)\), essentially a vector of length C channels where each feature map is now decomposed into a singular value.

  • Step 4: Calculate adaptive scaling weights for each channel using fully connected multi-layer perceptron bottleneck structure. In Fig. 4, first dense layer acts as a dimensionality reduction layer with a hyperparameter r as the reduction factor and it is set to 32 in this experiment. This hyperparameter is thoroughly described in [15]. Further, Swish activation function is applied to the subsequent layer. We used Swish activation function instead of ReLU because ReLU zeroes out the negative parts. However, negative values may still help capture the pattern underlying the image. Swish activation ensures that negative values do not become zero [34].

  • Step 5: Another dense layer is employed to return the channel dimension of size \(1\times 1 \times C\). This step is followed by sigmoid activation function that provides each channel a smooth gating function and scale the values in the range of 0 to 1 [15].

  • Step 6: The final output is obtained by performing element-wise multiplication between the feature maps in the input tensor and their respective learned weights from the previous step.

The attention module is integrated between feature extractor and FPN. We considered ResNet-152 as backbone and obtained the feature vector. ResNet-152 achieved mAP of 88.45% with the default parameter, which is better than all other backbone networks considered in this study. The attention module on top of the backbone network is shown in Fig. 5. All the blocks of the backbone network are connected to FPN as \(C_2, C_3, C_4, C_5, C_6\) and \(C_7\), which generates feature pyramid \(P_2, P_3, P_4, P_5, P_6\) and \(P_7\) layers. However, \(C_2\) block is not connected to FPN because of large memory consumption. FPN allows propagating rich features via a top-down path to lower stages. We applied the proposed attention module on the top of ResNet-152. This module is introduced between \(C_5\) and \(P_5\) feature maps. Also, the calculation of \(P_6\) feature maps is performed using \(C_5\) block with attention module results. In this study, \(P_5\) feature maps are used to extract the global contexts from the attention module. Thus global contexts are propagated to \(P_6\) and \(P_7\) in top-down pathway. We could add the attention module to \(C_3\) and \(C_4\). However, we observed during the experiments that the best performance is achieved when our proposed attention module is incorporated between \(C_5\) and \(P_5\).

Fig. 5
figure 5

RetinaNet with the attention module

3.3.2 Depth-wise separable convolution

Depth-wise separable convolution factorizes the standard convolution by separating the spatial filtering process from the feature generation mechanism. Hence, it includes depth-wise and point-wise layers separately. Unlike the normal CNNs, which apply convolution across all C channels simultaneously, the depth-wise operation performs convolution on each channel individually as shown in Fig. 6. The output size of \(H\times W\times C\) for C channels with \(N1\times M1\times 1\) kernels. Wherein point-wise convolution applies 1x1 convolution operation on C channels. For an instance of K such filters the output size becomes \(H\times W\times K\). In this study, the last layers of FPN (C6 and C7 blocks) in the modified RetinaNet model are replaced with depth-wise separable convolution. This convolution operation improved the convolution efficiency and significantly reduced the model’s training parameters [25].

Fig. 6
figure 6

Depth-wise separable convolution operation

3.3.3 Loss functions

Focal loss

One stage detector such as SSD and YOLO suffers from class imbalance problems. Focal loss is an extension of cross-entropy loss, used for classification where it emphasizes foreground classes and focuses on the fact that the loss of foreground classes is higher during training by down-weighting the background classes [24]. It is expressed in (4).

$$\begin{aligned} FL(p_t) = -\alpha _t (1-p_t)^\gamma \log (p_t) \end{aligned}$$
(4)

where \(\gamma \) focus more on foreground classes and \(\alpha \) is a weightage factor for class. When \(\gamma \) = 0, focal loss behaves like cross-entropy loss function. In this study, we utilized the default values of 0.25 for \(\alpha \) and 2 for \(\gamma \), which yielded satisfactory results.

Smooth L1-loss

Smooth L1-loss [9] is adopted to perform the box regression which is a combination of L1 and L2 loss functions. Equation 5 represents the utilization of the Smooth L1-loss, where y denotes the discrepancy between the ground truth and predicted value.

$$\begin{aligned} Smooth_{L1}={\left\{ \begin{array}{ll} 0.5y^2, &{} \text {if |y|<1}.\\ |y|-0.5, &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$
(5)

3.3.4 RetinaNet default parameter optimization

Instead of using the default parameters to train the model, we examined different hyperparameters to improve our proposed model performance. We considered initial learning rate as \(10^{-4}\) and lower bound on the learning rate as \(10^{-9}\) using ReduceLROnPlateau callback. On-the-fly data augmentation techniques were used. The number of epochs was set to 50 because RetinaNet converges quickly, so small number of epochs usually suffices. By default, model resizes the images in such a way that the shortest side is equal to 800px and if longest side exceeds 1333px then model resizes it into 1333 px. We selected IoU threshold as 0.5 because IoU at 0.5 normally considered a good prediction.

Anchor boxes Anchor boxes were introduced in Faster RCNN first and later it is added to YOLO, SSD and RetinaNet. It is highly difficult to generate region proposals for varied shape objects. Hence, anchor configuration plays a crucial role in object detection. RetinaNet uses default anchor boxes with different ratios and scales, which may not be effective for all types of objects. By default it uses 32, 64, 128, 256 and 512 sized anchors with different aspect ratios and scales at each pyramid level called P3 to P7 [23]. A total of 9 anchors are used at each level. We considered two algorithms, namely k-means clustering and differential evolution search algorithm [37] to obtain ratios and scales of anchor boxes, three each on our dataset. However, ratios and scales generated using differential evolution search algorithm resulted in better mAP than k-means clustering. Our goal was to obtain the best anchor setting to maximize the overlap between bounding box and anchor box. We obtained optimal scales as 1.093, 1.382, 1.73 and ratios of 0.662, 1.0, 1.509. These optimal values were used to train the detector.

4 Results and discussion

In this section, we present the outcomes of our experiments, as well as some limitations of our system and potential solutions to overcome them. We describe the performance metrics employed in this study in Section 4.1. Experimental results are shown and comparative analysis with state-of-the-art methods is provided in Section 4.2. Finally, we discuss the overall findings of our study and its limitations in Section 4.3.

4.1 Performance metrics

Standardized evaluation metrics should be used to measure the overall performance of the object detection model. The article referenced [26] provides a detailed discussion on the metrics used to evaluate the performance of object detection models. We established a threshold of 0.5 for the IoU, Precision (PR), Recall (RE), Miss Rate (MR), False Positive Per Image (FPPI) and mAP metrics. The model is tested on a dataset containing 87 keratin pearl instances and the evaluation is based on the number of True Positives (TP), False Positives (FP) and False Negative (FN) detections. TP represents a correctly detected keratin pearl class with the correct label, while FP indicates a detected object that is not a keratin pearl and FN refers to keratin pearl that the model did not detect properly. As the dataset does not contain any examples of negative instances, there is no true negative class to consider. PR, RC and F1-score metrics crucial in medical field where PR measures the ratio of correct predictions to all detections, RC evaluates the ratio of correct predictions to potential detections and F1-score reflects the balance between PR and RC. Further lower values of MR and FPPI indicate better performance for both measures. The TP, FP and FN values of our proposed model and state-of-the-art models are illustrated in Fig. 7.

Fig. 7
figure 7

Truth and predicted value representation of test set

Fig. 8
figure 8

Keratin pearl detection using default RetinaNet and proposed model

4.2 Results

In this experiment, we validated our proposed model and conducted a comparative analysis with default object detection networks, namely YOLOv3, SSD, YOLOv5-S, YOLOv5-M, YOLOv5-L, YOLOX and RetinaNet, as described in Section 4.2.1. Further, we compared the results with traditional image processing approach proposed by Das et al. [3], which is discussed in Section 4.2.2.

Fig. 9
figure 9

Keratin pearl detection on 100x magnification images using proposed model

Fig. 10
figure 10

mAP curves of the default model without augmentation, default model with augmentation and proposed model

4.2.1 Qualitative analysis

Figure 8 shows the keratin pearl detection on test images with default RetinaNet and proposed model, respectively. The yellow circles in Fig. 8 are keratin pearls detected using the proposed model but not detected with default RetinaNet. The proposed model detected small keratin pearls and also enhanced the confidence of detected objects. We also tested with normal tissue images where keratin pearls are not present. The model does not detect any other object as keratin pearl and hence, the image shown in Fig. 13 (d7) and (e7) does not have any bounding boxes. The RetinaNet model is trained on 40x images. However, it is able to detect the keratin pearl from the images captured at 100x magnification as well, shown in Fig. 9. When we visualise the same area at higher magnification keratin pearl looks bigger and hence we justified that the model trained with low magnification can also detect the keratin pearls at higher magnification. We used Tensorboard visualization toolkit to evaluate the model performance. The mAP and loss curves are depicted in Figs. 10 and 11. The mAP curve demonstrates that the proposed model successfully attains an mAP score exceeding 90% by 35 epochs.

We performed a comparative analysis of the proposed model with state-of-the-art models and presented the results in Figs. 12 and 13. The state-of-the-art models are YOLOv3, YOLOV5 (small(S), medium (M), medium with 1 to 15 layers frozen (M frozen) and large (L)) and YOLOX. Our findings show that, unlike all other models, YOLOv3 did not perform well in detecting keratin pearls. YOLOv5 outperformed YOLOv3 by detecting more keratin pearls. However, the proposed model detected more keratin pearls than YOLOv5. YOLOX detected almost all keratin pearls, similar to our proposed model, but it also detected eosinophilic cytoplasm and blood vessels as keratin pearls, which is a drawback. The main issue with existing models is that they tend to predict blood vessels as keratin pearl in normal images, which is not observed in our proposed model. In Figs. 12 and 13, we have marked the missed keratin pearl detection in black color, while the false positive detection is shown in yellow box and the improper prediction in green box.

Data augmentation involves applying a series of image transformations to the original images. These transformations create variations in images, resulting in new images that are different from the original ones. We applied on-the-fly data augmentation technique which involves randomly applying various transformations to the input images, such as rotation, translation, shearing, scaling, flipping and also varied the image brightness, contrast, hue and saturation to produce the augmented data. This technique allows the model to learn from a larger and more diverse set of training images, which can help improve its performance, generalization ability and can avoid overfitting of the model. The proposed model with augmentation demonstrates an improvement in mAP value when compared to the default RetinaNet model with augmentation, as depicted in Fig. 10.

Fig. 11
figure 11

Loss curves of the default model and proposed model

Fig. 12
figure 12

Keratin pearl detection using State-of-the-art models

Fig. 13
figure 13

Keratin pearl detection using state-of-the-art models and proposed method

Finally, we compared our results with the traditional segmentation method proposed by Das et al. [3] and the segmentation result is shown in Fig. 14. However, small keratin pearls are not segmented in this method. We considered the segmentation method proposed by Das et al. [3] to compare our results as very few articles have been reported on keratin pearl detection. Figure 14.b shows the ground truth bounding boxes in black color. The red region shows the segmented output and arrows indicate the false positive regions. When we compare the results of Das et al. with the proposed approach visually, we observed that the approach proposed by Das et al. fails to detect all keratin pearls and some regions that are not keratin pearls are also picked up by their method. However, the proposed model detected keratin pearls more appropriately with a good confidence score as shown in Fig. 8. Besides, it is essential to detect the keratin pearl for SCC diagnosis rather than segmenting them out. An image is classified as well differentiated SCC, if more number of keratin pearls and malignant cells are present in it. Also, not detecting all keratin pearls in an image may lead to wrong grading of SCC. Hence, our detection approach would be more feasible than traditional approaches even in SCC grading.

Fig. 14
figure 14

Results obtained using Chan-Vese segmentation method

4.2.2 Quantitative analysis

Table 2 provides results using different backbone networks for RetinaNet with various IoU thresholds and their respective mAP values. Among these networks, ResNet-152 achieved the highest mAP. To enhance the model’s performance, an attention module is introduced and hyperparameters are fine-tuned. As a result, the proposed model’s mAP improved by approximately 4% and it could detect more keratin pearls with higher confidence scores in an images as shown in Fig. 13. In the original RetinaNet model, the backbone network used 6 blocks comprising four ResNet blocks and two more blocks are introduced by applying standard convolution operations to the C5 block, which is the last block of ResNet. However, in this study, we replaced the standard convolution operations with depth-wise separable convolution in the last two blocks of the network, namely C6 and C7. This approach is implemented with the aim of minimizing the number of model parameters. Table 3 displays the outcomes after using depth-wise separable convolution in C6 and C7 blocks of the network model, which did not affect the detection performance or increase the computational cost.

We conducted a comparative analysis between the proposed model with and without data augmentation, and found that the proposed model with augmentation outperformed the default RetinaNet model. When RetinaNet model is trained with augmentation, it is able to achieve an mAP of 88.45 ± 0.06%. However, the proposed model, which incorporated depth-separable convolution and augmentation techniques achieved an mAP of 92.63 ± 0.05%. This suggests that the data augmentation improved the proposed model performance by around 4%. We have presented the results of our experiment in Table 3.

Table 2 RetinaNet model with different backbones
Table 3 Default RetinaNet and proposed model results

We considered various YOLO family models with transfer learning approach which includes YOLOv3, YOLOv5-S, YOLOv5-M, freezing the first 15 layers of YOLOv5-M, YOLOv5-L and YOLOX. We used the default setting with 50 epochs and applied on-the-fly data augmentation technique. Among these, YOLOv5-L achieved slightly better mAP than the proposed model. However, the proposed model achieved an mAP of above 90% by 35 epochs using early stopping. The training time for YOLOv5, YOLOX and RetinaNet is around half an hour, while YOLOv3 took several hours to train on our dataset. Further, we also implemented Single Shot Detection (SSD) model. However, the results are not satisfactory. Table 4 presents the mAP of various models at IoU 0.5 with default feature extractors.

Table 4 Various models and its mAP values

Further, we considered a test dataset containing 87 instances of keratin pearls and various performance metrics are tabulated in Table 5. Upon comparison with YOLOv3, YOLOv5 and YOLOX, it is observed that YOLOv3 performed poorly with fewer TP and more FN cases. YOLOv5-M reported higher PR than YOLOX, but the proposed model achieved much better PR than all other models. The MR of the proposed model and YOLOX is nearly identical, but the proposed model’s FPPI is less as compared to other models. Although YOLOX achieved slightly better recall than the proposed model with fewer FN cases, the weighted average of precision and recall, represented by the F1-score is higher for the proposed model than all the compared models.

When evaluating the performance of different models for keratin pearl detection, we compared the inference speed of each model. Among the models we tested, YOLOX had the fastest inference speed and outperformed all other models in this regard. The performance metrics listed in Table 5 are considered more significant factors than speed in the context of detecting keratin pearl on static images. During the experiments, more importance is given to these metrics over speed and found that the proposed technique for identifying keratin pearl outperformed state-of-the-art methods.

Table 5 Comparison of performance with various object detection models

We implemented depth-wise separable convolution instead of standard convolution in the feature extraction layer. Although there is a small difference in mAP, it did not affect the detection accuracy. We also listed the number of parameters used in each experiment. From Table 3, it is clear that the model parameters reduced by 6.2M after using depth-wise separable convolution network.

We also developed two additional models with varied subnetwork filter sizes to show the importance of filter size in detection. Table 6 presents the results of these three models.. With small kernel size, model learns complex and non-linear features compared to larger kernel size. Hence, in this study 3x3 kernel size provided better results than 5x5 and 7x7 kernel size.

Table 6 RetinaNet model results with varied filter sizes

4.3 Discussion

Our primary objective is to introduce object detection technique for microscopic feature detection, as sometimes feature detection is more crucial than segmenting them out. This is because pixel-based annotation, which is required for segmentation, can be a time-consuming and tedious task. On the other hand, object detection uses bounding boxes which are less time-consuming and easier to annotate. We proposed an object detection model to detect one of the crucial microscopic feature of SCC namely keratin pearls. The proposed model is evaluated by measuring various metrics such as MR, FPPI, PR, RE, F1-score and inference time. We also compared the results with existing state-of-the-art object detection models that are implemented on same dataset. The results indicate that the proposed model can effectively detect keratin pearls. The model also tested on negative images that do not contain any keratin pearl and it did not detect any object. In contrast, all the state-of-the-art methods detected blood vessels as keratin pearls, which can be considered as false positive cases.

We conducted additional experiments to test robustness of our model on 100x magnification images and compared the results with other methods which is depicted in Fig. 15. One limitation of this study is that our model detected some blood vessels in the 100x magnification images. However, it is worth noting that the number of blood vessels detected by our model is relatively low. In contrast, all other methods detected almost all blood vessels as keratin pearl. This ambiguity in the results could be reduced by incorporating negative cases during the training of the model.

Fig. 15
figure 15

False positive detection on 100x magnification images

The proposed model is trained using images captured with a specific microscope and camera setup, specifically the Olympus CX31 microscope with a DP21 attached camera, referred to as M1. However, during testing, two different microscopy images are used, including M1 and Olympus BX51 microscope with a DP80 attached camera, referred to as M2. The study revealed that there are higher rates of FP and FN in images captured using M2 compared to M1. This could be due to differences in resolution settings and lighting conditions between the two microscopes, which can impact the quality and clarity of the images.

In the future, there is an opportunity to enhance the mean average precision (mAP) of the model by adding more images to the training set. These images could be captured using different microscopes and magnification levels to increase the robustness of the model. Additionally, further investigation could be done to determine how different microscopes and magnification levels impact the performance of the model. Secondly, we have identified a significant characteristic of SCC. However, it is also essential to identify other microscopic features to grade SCC into different classes.

Further, this study attempted to improve the results by changing the filter size, but there is no noticeable improvement. In the future, we plan to conduct additional experiments by modifying other parameters and loss functions.

5 Conclusion

SCC is a prevalent cancer type that can develop in various organs of the human body. Histopathology is commonly used to diagnose cancer and previous studies have focused on classifying SCC from other types of carcinoma. In contrast, this study utilized an object detection model to identify one of the critical features of SCC. Our objective is to demonstrate the effectiveness of object detection methods in the medical field. Prior research has typically employed traditional machine learning or pixel-based CNN models to identify different histopathological features of carcinoma. However, we introduced an object detection model to detect keratin pearls on histopathological images, which is a hallmark of SCC diagnosis. We initially used a RetinaNet model to detect this feature and further improved its performance by adding an attention module. We also compared our results with state-of-the-art methods and achieved a nearly 4% improvement over the default RetinaNet model, with minimal false positive instances compared to existing methods.