1 Introduction

In the field of speech recognition and Natural Processing Language (NLP), transformers have contributed significantly to its development ever since in the history of machine learning. Although Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) contributed well to sequence-oriented modeling and NLP fields, recently transformers have outperformed over existed sequence modeling methods, such as RNNs and LSTMs. Not just confined to sequence-oriented problems, vision transformers also have started playing their role in the computer vision discipline. Nowadays, computer vision transformers are mainly being used for image classification [1], object detection [2], and semantic segmentation [3]. The mechanism of multi-head self-attention being the power of transformers is used for different patches using NLP techniques and transformers, in which the size of the image is divided into a sequence of fixed-size patches. While compared to CNNs [4] (Convolutional Neural Networks), the vision trans- formers are better capable to capture long-range sequence inter dependencies and their extracted features are rich in more semantic information than that of CNNs. Although the transformers are considered as the best alternative to CNNs and NLP, there are still more aspects that need to be improved. In T2T-ViT [5], the embedded patch is decomposed using the neighboring tokens iterative aggregation method to enhance local detection. The grain level improvements are made using Transformer in Transformer (TNT) methodology the inside patch. The Pyramid Vision Transformer (PVT) architecture is divided into four steps along with an inclusion of a feature pyramid to perform dense prediction.

Convolutional Neural Networks (CNNs) are highly advantageous for image and spatial data analysis due to their ability to automatically extract hierarchical features through shared weights and convolutions, enabling translation invariance and reducing manual feature engineering. On the other hand, Multi-layer Perceptrons (MLPs) offer flexibility for various tasks involving structured data, as they can capture intricate non-linear relationships, making them suit- able for tasks like regression and classification, particularly when dealing with tabular data, though they might require more data and careful feature engineering to achieve optimal results compared to CNNs.

All the techniques use the same patching method called fixed-size patch embedding with the assumption that an image split into fixed patches is better suitable for a wide range of images. Besides its benefits, the fixed-size patching has two major drawbacks regarding the undergone image: (1) the loss of its local structure and (2) the loss of semantic structure. About the former, the fixed-sized patch, say (16 × 16), will not be sufficient to capture the whole object of varied sizes within an image. In the former, an image containing different instances of the same objects with different angles and dimensions do not get recognized by the fixed-size patch efficiently, hence causing a loss of semantic information. Moreover, using fixed-sized patches is not capable of capturing a different scaled object within an image and thus affects the performance of the model.

To address the problems, we propose a novel patching method called intelligent-patch, from hereafter is called as Intel-Patch, which splits an image into unfixed and semantic-based patches so that object semantic information can be preserved. The unfixed-sized patches with different aspect ratios in accordance with the object size will ensure the preservation of the object semantic contents. To perform intelligent based patching input feature map will play its role in calculating object scales and offsets. The proposed Intel-Patch model consists of fewer parameters i.e. self-attention, model dimension, number of heads etc. and smaller number computations as compared to PVT [6], hence ensuring a lightweight end solution and fast computations. The Intel-Patch is designed as a complete module that can be attached, detached, and augmented to any form of vision transformer. A transformer with an Intel-Patched module augmented with it is called an Intelligent Patch-based Pyramid Vision Transformer (IntelPVT). In our proposed work we augmented an enhancement to the Pyramid Vision Transformer (PVT) by means of the Intel-Patch integration. The PVT presently outperforms in terms of speed and accuracy by inculcating a pixel-level detection and segmentation but still lacking an intelligent based image patch mechanism and the space will be filled through IntelPVT. Due to its power of splitting an image based on its contextual features, the IntelPVT can overcome the problems of PVT.

This paper is structured as follows: Section 1 provides a brief introduction about the topic under research. Section 2is the literature review where previous relevant research activities are discussed. Section 3 illustrates our proposed methodology. Algorithms and pseudo code are also given in this section. Section 4 provides the experimental results of the proposed algorithm applied on various datasets. Section 5 concludes the paper with discussions of the limitations of our method and future work plan.

2 Literature review

Transformers, prominent in NLP, made their mark in Vision too, supplanting CNNs. Carrion and Massa introduced Detection Transformer (DETR), a method treating object detection as a set prediction challenge. Manual steps like suppression and anchors are obviated. DETR leverages global context and employs set-based loss with encoder-decoder structure. Notably simple, it matches Faster (Regions with CNN) R-CNN’s COCO performance. DETR’s scalability enables uniform panoptic segmentation [2]. Qi and Dai [4] introduced geometry-transforming methods to surpass CNNs’ fixed structures. Deformable convolution and adaptable Region of Interest (ROI) pooling are novel modules enhancing transformation modeling. These modules add offsets to spatial positions, learned without extra supervision. In CNNs, these units replace basic ones seamlessly and train through standard backpropagation. Extensively assessed on complex tasks like object detection and semantic analysis, their approach demonstrates notable success.

Li et al. [5] introduced a novel neural architecture utilizing attention mechanisms to encode input data into valuable features. Input data is divided into small patches, enabling representation calculation and comparison. Managing natural images’ intricate details, color, and contrast to isolate objects of diverse sizes and locations requires focusing on local patches. The pro- posed design, Transformer in Transformer (TNT), subdivides large patches into visual words, akin to visual sentences, enhancing representation by aggregating word and sentence features. Experiments across datasets exhibit TNT’s effectiveness, achieving 81.5% top-1 accuracy on ImageNet, surpassing the state-of-the-art by approximately 1.7%.

Wang and Xie [6] introduced a powerful neural network for complex challenges. Leveraging the Pyramid Vision Transformer (PVT), this model offers flexibility compared to recent Transformer-based classifiers. Unlike ViT, which employs dense partitions and shrinking pyramids for computational efficiency, PVT maintains high-resolution outputs while effectively utilizing dense features. It combines Transformer and CNN benefits, performing well in vision tasks. Rigorous experiments highlighted its superiority, enhancing accuracy in object detection, semantic segmentation, and instance segmentation. Notably, PVT outperformed RetinaNet + Residual Network (ResNet50) on Common Objects in Context (COCO) dataset, scoring 40.4 AP (Average Precision) versus 36.3 AP. PVT emerges as a promising alternative for pixel-level predictions, accelerating research endeavors.

Convolutional networks have proven versatile in diverse computer vision tasks, leading to substantial advancements since their mainstream adoption around 2014. Model depth directly impacts task quality via computational cost, making effective computation utilization key. Optimized convolutions and robust regularization enhance results. In the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2012 validation set, top- 1 error was 21.2%, and top-6 error was 5.8%. Despite modest computational expense (5 billion multiply-adds and under 25 million parameters), collaborative multi-crop evaluation achieved an impressive 3.5% error rate on the validation set (3.7% on the test set). Additionally, the model displayed a 17.3% error rate on the validation set [7].

Chu, Tian, and colleagues [8] introduced the Conditional Positional Encoding (CPE) scheme for vision transformers. Unlike previous static or learnable positional encodings, which are fixed and independent of tokens, CPE dynamically generates encodings based on the inherent context of feedback tokens. This enables CPE to effectively adapt to input sequences longer than seen during training, enhancing generalization. In image classification, CPE preserves desired translation invariance, leading to improved classifier accuracy. Integrating CPE is seamless via a simple Positional Encoding Generator (PEG). They introduced Conditional PEG (CPVT) for positional encoding. Interestingly, attention maps of CPVT and learned positional encodings exhibit visual similarity. Comparative results highlight that conditional positional encoding, when compared to traditional vision transformers, achieves superior performance in Large Scale Visual Recognition Challenge (ImageNet) classification.

D’Ascoli et al. [9] explored the constructive collaboration of convolutional architectures’ sample efficiency and self-attention’s flexibility for image processing. They introduced GPSA (Gated Positional Self-Attention) that fuses convolutional inductive bias with positional self-attention’s adaptability. In image classification, self- attention enhanced Vision Transformers (ViTs) to outperform CNNs, indicating the importance of external datasets and pre-training. Separately, Ghiasi et al. [10] advanced image classification and object detection through automated augmentation strategies. By probing a single distortion level for all processes, they achieved efficient augmentation without a distinct proxy task, reducing computational costs while still outperforming prior policies on multiple datasets. EfficientNet-B7 and EfficientNet-B8 achieved significant improvements in ImageNet accuracy, and their approach boosted object detection as well.

In species differentiation, challenges like precise localization, fine-grained learning, and class distinction exist. RA-CNN (Recurrent Attention Convolutional Neural Network) integrates region detection and fine-grained learning, improving both. Its APN(Access Point Name) sub-networks enable iterative region attention refinement. RA-CNN excels without bounding box annotations, boosting fine-grained task accuracy by 3.3% (Birds), 3.7% (Stanford Dogs), and 3.8% (Stanford Cars) [11]. Meanwhile, Gao, Zheng, and Wang introduced SMCA (Schematic Mask and Cross Scaling Adaptive Feature of CNN) to accelerate DETR convergence. Enhancing the co-attention mechanism, SMCA significantly speeds up DETR framework while maintaining accuracy (45.6 mAP at 108 epochs). DETR, a transformer-based model for object detection, benefits from SMCA’s improved decoder [12].

In [13], He et al. introduced the Pyramid Vision Transformer (PVT), addressing Transformer model limitations in dense prediction tasks. With residual learning, deep neural network training was eased, achieving deep nets with significantly fewer errors. On ImageNet, a 152-layer residual net yielded only 3.6% error. In [14], Hu, Shen, and Sun introduced the Squeeze-and-Excitation (SE) block for improved feature learning. SE blocks, effectively integrated into architectures, drastically reduced faults, advancing classification accuracy by 30%. [15] by Lin, Goyal, and Girshick proposed Focal Loss to alleviate foreground- background imbalance in dense object detectors. Retina Net, equipped with Focal Loss, surpassed two-stage detectors’ accuracy while matching one-stage detectors’ speed.

Liu, Lin, and Cao [16] introduced Swin Transformer, a versatile computer vision backbone addressing differences between text and images. The S- windowed hierarchical structure combines local windowing and cross-window connections, enabling efficient self-attention calculations while accommodating varying image sizes. Swin Transformer excels in tasks like image classification (87.3% top-1 accuracy on ImageNet-1K) and object detection (58% box AP and 50% mask AP on COCO test-dev, 53.5 mIoU on ADE20K val). Ren and Kaiming [17] highlighted the significance of regions in advanced object detection networks, introducing the Region Proposal Network (RPN). RPN shares convolutional functions, enabling cost-free region suggestions. RPN predictions guide the unified network’s focus in Faster R-CNN, contributing to their success in ILSVRC and COCO competitions.

Srinivas and Lin [18] introduced BoTNet (Robot Network), a potent self-attention-based backend architecture for diverse computer vision tasks. By replacing spatial convolutions with global self-attention in a ResNet’s last three blocks, BoTNet significantly improves instance segmentation and object recognition while maintaining computational efficiency. BoTNet demonstrated performance comparable to ResNet and achieved 44.4% Mask AP and 49.7% Box AP on COCO Case Segmentation. Zheng and Lu [19] proposed SETR (Segmentation Transformer), a Sequence-to-Sequence Transformer for semantic segmentation. SETR encodes images into patches using transformers and a simple decoder, achieving state- of-the-art results on ADE20K (50.28% mIoU (Mean Intersection Over Union) and Pascal Context (55.83% mIoU) datasets. A few papers with the same scope can also be shown from [20,21,22,23,24].

3 Proposed methodology

The vision transformers, which have already achieved benchmark milestones in the field of computer vision and object recognition, are confronting a severe problem of splitting an image into fixed size patches due to which their results are affected badly as they lack semantic object information. To counter the problem, we have proposed a novel patch splitting and embedding technique called Intelligent Patch-based Splitting technique which is capable of splitting input image into their semantic based patches instead of fixed sized patches. In this section an overview of a generic vision transformer has been discussed followed by the working mechanism of IntelPVT.

3.1 The architecture of a generic vision transformer

The architecture of a typical Vision Transformer is consisted of 03 modules: Patch Embedding Module; Multi-head Self-attention Module and Feed-forward Multi-layer Perceptron (MLP)

The first module is patch embedding module which will take an input image and perform transformation on it by converting it into series of tokens. The tokens are then fed to multi-head self-attention module and multi- layer perceptron to get final recognition. Because our work deals with the intel-based attention mechanism of the ViTs, we shall focus on the patch embedding module of the ViTs.

3.2 The mechanism of generic patch embedding module

Because a sequence or NLP based Transformer accepts only sequential based inputs, a patch embedding module was introduced to split the input image into multiple parts so that to form a sequence of patches to further forward it to the NLP-Transformer. We listed in Table 1 all notations used in this paper.

Table 1 Summary of notations

The patch embedding module typically perform following functions:

  1. 1.

    Division image into fixed sized and positions patches

  2. 2.

    Embedding of patches with linear layer

Suppose we have and input image I, with a feature map: I ∈ RH×W ×C, for image with equal height H and width W:

feature map: I ∈ RH ×C

For an image I ∈ RH×W ×C if to be split into fixed sized patches of size:

w2 = w × w, where \(w = \left[ \frac{W}{N} \right]\) then

If z denotes a fixed patch, then its sequence is denoted as z(i), where z(i) denotes the ith patch of the sequence.

As we know for a patch of fixed size w2 = w × w denotes a part of input image as a square area, say for as z(i) sequence of N patches, the center of ith patch from the sequence is \((x_{ct}^{(i)} = ,\,\,y_{ct}^{(i)} )\)

Because the size of the ith patch z(i) is fixed as w2 = w × w, hence the limit. corners of the ith patch are: Left-Top corner and Right-Bottom corner can be. calculated as: Left-Top corner \(\left( {x_{ct}^{i} - \frac{w}{2},\,\,y_{ct}^{i} - \frac{w}{2}} \right)\) and Right-Bottom corner \(\left( {x_{ct}^{i} + \frac{w}{2},\,\,y_{ct}^{i} + \frac{w}{2}} \right)\)

Moreover for every pixel \(\mathop{\longrightarrow}\limits_{{p^{(i,j)} }}^{}\) inside the region w2 = w×w can be expressed as \(\left( {p_{x}^{{\left( {i,j} \right)}} ,\,\,p_{y}^{i,j} } \right)\), Where \(\mathop{\longrightarrow}\limits_{{p^{{\left( {ij} \right)}} }}^{} \in {\mathbb{Z}}.\)

For every \(i^{th} \,,j^{th}\) pixel \(\mathop{\longrightarrow}\limits_{{p^{(i,j)} }}^{}\) its feature vector is denoted by \(\widetilde{{f^{{\left( {i,j} \right)}} }}\)

Similarly for all the corrdinates of pixels \(\left( {p_{x}^{{\left( {i,j} \right)}} ,\,\,p_{y}^{i,j} } \right)\); their feature matrix can be expressed as \(\left\{ {\widetilde{{f^{{\left( {i,j} \right)}} }}} \right\}1 \le j\, \le N\)

After extracting the features the \(\left\{ {\widetilde{{f^{{\left( {i,j} \right)}} }}} \right\}1 \le j\, \le N\) is flattened and further fed to a linear layer to get final recognition of the object as follows

$$z_{1\,\, \le i\, \le N}^{\left( i \right)} = W_{patch} *concat\left\{ {\widetilde{{f^{i,j} }}} \right\}1 \le i\, \le N$$

The aggregation of the sequence-wise relative patch-based information is done by the self-attention module of the ViT by means of assigned tokens. This module takes three feature inputs: the Query Qh, Key Kh and Value Vh such that; Qh, Kh, Vh RN × d.

The purpose of calculating attention map is to get the similarity between the patches by multiplying both Qh and Kh such that Attnh = softmax \(\left[ {Q_{h} *\frac{{k_{h} }}{\sqrt d }} \right]\)

And finally, to summing up along with weights and Vh such that

$$Z = W_{h} *concat\left\{ {Attn_{h} \,V_{h} } \right\}1 \le h \le H$$

Figure 1 demonstrates an input image that is randomly taken from the MS Coco dataset to evaluate IntelPVT. In figure 2, is a random object taken from the input image to demonstrate the expected response of IntelPVT on any random object; Figure 3 shows that the random object may be first Patched into equal-sized patches when it goes through generic ViTs; Figure 4 shows that the input image is split into a series of images as per the mechanism of generic ViTs; Figure 5 shows that all the split patches are aligned in a single row to form an ordered sequence.

Fig. 1
figure 1

Input Image

Fig. 2
figure 2

Splitting of patches

Fig. 3
figure 3

Splitting of patches

Fig. 4
figure 4

Splitting of patches

Fig. 5
figure 5

Series of splitting images

3.3 Architecture of IntelPVT

The Intelligent Patch-Based Vision Transformer (IntelPVT) will be able to predict the size and position of the patch through the Intelligent patch mod- ule plugged into a standard ViT. Hence, it will perform more semantically efficiently than the current ViTs. In contrast to vanilla-based fixed patches where the patch sizes are fixed, rigid, and invariable, and offer no flexibility as per object semantics, we have introduced a new patching method called the intelligent-based patching method. We have introduced the following method to get an intelligent patch-based response from a ViT.

3.4 Prediction of Intel-based patch offset

Instead of relying on fixed patch dimensions we shall first predict the scale and location of the rectangular intelligent patch based on input image. For this prediction, an additional offset (δx, δy) will be calculated from the patch center (x, y), (δx, δy) will allow the center to shift from center as per offset values (δx, δy). Having predicted a new offset center (δx, δy) the aspect ratio of the patch will be calculated from the new center as (sh × sw).

Hence a new rectangular region will be calculated with coordinates (x1, y1) and (x2, y2) as top-left corner and right-bottom corner of the Intel-patch, respectively. To predict these offset parameters, we augment a novel module which will predict the center, offset and aspect ratio of the new patch according to the features of the features map. To predict them precisely an initial zero is to be assigned to (δx, δy) which is a trainable parameter.

3.5 Extraction of features from the predicted regions

When the prediction of intelligent based rectangular region is predicted next, we shall extract the features from each predicted patch. Because each patch will contain different scales and aspect ratios and hence will be a great challenge to extract their feature. This problem will be addressed by assigning a uniformly distribution of (k × k) points to the predicted region. After the sample points are assigned, the patches will be then flattened through a linear layer for further embedding and reassembling them back into their original form along with required objects to be detected efficiently and more intelligently.

3.6 The proposed architecture with Intel-Patched module

We use the following figures to illustrate our proposed architecture with Intel- Patched module.

In Figure 6, from input image an object is targeted for detection. Firstly, the ViT based equi-sized patch is to be calculated. The detection of center coordinates (xci, yci) is predicted followed by boundary corners (xci ± w, yci ± w). After that IntelPVT offsets (dx, dy) are predicted and applied to form intelligence-based patching. These patches are then fed into the regular ViT fattening layer to finally detect the object.

Fig. 6
figure 6

Proposed architecture with Intel-Patched module

In Figure 7, defines the role of IntelPVT as equi-sized splitting patches. The (x, y) coordinates are calculated and taken as input by IntelPVT; Figure 8 shows that IntelPVT sets new Intel Offsets (dx, dy) coordinates as per Intelligent Patching mechanism; Figure 9 shows that all patches are splits on uneven aspect ratio as defined by IntelPVT instead of equi-sizing.

Fig. 7
figure 7

Splitting Patches

Fig. 8
figure 8

New Intel Offsets

Fig. 9
figure 9

Splitting image based on Intel Offset

Figure 10 shows an example of the final output image, which is detected when a random image is given to trained IntelPVT.

Fig. 10
figure 10

Final Image

3.7 Pseudo code for IntelPVT

The pseudo code of the proposed method is in Algorithm 1.

figure a

3.8 Assumptions

Following assumptions were made during the simulation phase:

  1. 1-

    The images used in the experiments are of high quality and do not contain any noise or distortion.

  2. 2-

    The objects in the images are well-defined and do not overlap with each other.

  3. 3-

    The proposed method can identify the important semantic content in the images and extract patches that capture this content.

  4. 4-

    The proposed method can learn the relationships between the patches and use this information to improve the performance of object detection and classification.

4 Results and discussion

4.1 Performance of IntelPVT with other techniques

The tables display how well IntelPVT-Tiny, a new technique, performs in comparison to other advanced methods on three widely recognized datasets: MS Coco [25], Pascal VOC [26], and Cityscapes [27]. The results indicate that IntelPVT-Tiny stands out and performs better across various evaluation criteria.

4.2 MS Coco dataset results

4.2.1 RetinaNet 1x:

IntelPVT-Tiny achieved an improved Mean Average Precision (mAP) of 39.5, with notable increases in mAP50 (60.4) and mAP75 (41.8). However, there were slight decreases in mAPs (23.7) and mAPm (43.2), while mAPL saw a substantial increase (52.2). PVT-Tiny attained a mAP of 36.7, with higher scores in mAP50 (59.2) and mAP75 (39.3), showing impressive performance in detecting objects at these confidence thresholds. ResNet18 had a mAP of 31.2, with lower scores across the board, except for mAP75 (40.1). ResNet50 achieved a mAP of 38.2, showing consistent results across mAP, mAP50, and mAP75. Among the PVT models, PVT-Small had a mAP of 36.3, excelling particularly in mAPs (44.3), while PVT-Medium achieved 35.6 mAP, with a higher score in mAPm (45.7). PVT-Large reached a mAP of 38.8, with stronger performance in mAP50 (55.6). The results of RetinaNet 1X on the MS COCO dataset are displayed in Table 2.

Table 2 Results of MS Coco dataset in comparison with other techniques

4.2.2 RetinaNet 3x:

IntelPVT-Tiny demonstrated improvement with a mAP of 41.9, and significant gains in mAP50 (62.9) and mAP75 (44.9), while experiencing a decrease in mAPs (26.4) and mAPm (45.6). mAPL notably increased (55.2). PVT-Tiny achieved a mAP of 37.2, with higher scores in mAP50 (60.3) and mAP75 (40.2). ResNet18 had a mAP of 32.8, with lower scores across all thresholds except mAP75 (43.3). ResNet50 reached a mAP of 33.6, with consistent performance across different thresholds. Among the PVT models, PVT-Small achieved a mAP of 33.3, excelling in mAPs (44.3). PVT- Medium achieved a mAP of 39.6, with stronger performance in mAPm (47.2), while PVT-Large attained a mAP of 38.1, excelling in mAP50 (53.5).

In both RetinaNet 1x and RetinaNet 3x setups, the IntelPVT-Tiny model showed promising improvements in specific metrics, particularly in higher confidence thresholds, making it a notable performer in object detection tasks. The results of RetinaNet 3X on the MS COCO dataset are displayed in Table 3.

Table 3 Results of MS Coco dataset in comparison with other techniques

4.3 Pascal VOC dataset results

4.3.1 RetinaNet 1x:

IntelPVT-Tiny achieved an improved mean average precision (mAP) of 40.5, with notable increases in mAP50 (58.3) and mAP75 (45.7). However, there was a slight decrease in mAPs (33.7), while mAPm (57.2) and mAPL (39.4) saw significant increases. PVT-Tiny attained a mAP of 35.3, with higher scores in mAP50 (57.1) and mAP75 (41.1). ResNet18 had a mAP of 30.7, with lower scores across the board, except for mAP75 (43.1). ResNet50 achieved a mAP of 35.3, with consistent results across mAP, mAP50, and mAP75. Among the PVT models, PVT-Small achieved a mAP of 31.7, with notable performance in mAP50 (52.8), while PVT-Medium achieved 32.4 mAP, with higher scores in mAPm (48.7). PVT-Large reached a mAP of 33.3, excelling in mAP50 (54.2). The results of RetinaNet 1X on the Pascal VOC dataset are displayed in Table 4.

Table 4 Results of Pascal VOC dataset in comparison with other techniques

4.3.2 RetinaNet 3x:

IntelPVT-Tiny demonstrated improvement with a mAP of 36.7, and notable gains in mAP50 (54.2) and mAP75 (40.9), while experiencing a decrease in mAPs (35.8). mAPm (52.6) and mAPL (34.2) also increased. PVT-Tiny achieved a mAP of 31.2, with higher scores in mAP50 (47.8) and mAP75 (39.6). ResNet18 reached a mAP of 35.6, with consistent performance across different thresholds, especially in mAPs (43.0) and mAPm (38.0). ResNet50 had a mAP of 36.1, excelling in mAP50 (50.7) and mAP75 (38.5). Among the PVT models, PVT-Small achieved a mAP of 35.5, with impressive performance in mAPm (46.3). PVT-Medium attained a mAP of 33.5, excelling in mAP50 (53.1), while PVT-Large achieved 31.2 mAP, with notable performance in mAPs (37.1) and mAPm (34.8). These results highlight the varying performance of different backbones across different evaluation metrics, indicating strengths and weaknesses in object detection tasks. The results of RetinaNet 3X on the Pascal VOC dataset are displayed in Table 5.

Table 5 Results of Pascal VOC dataset in comparison with other techniques

Performance in mAPs (37.1) and mAPm (34.8). These results highlight the varying performance of different backbones across different evaluation metrics, indicating strengths and weaknesses in object detection tasks.

4.4 Cityscapes dataset results

4.4.1 RetinaNet 1x:

In the context of RetinaNet 1x, the IntelPVT-Tiny (ours) backbone exhibited an enhanced mean average precision (mAP) of 41.2, highlighting noticeable improvements in mAP50 (59.8) and mAP75 (43.5). Although there was a slight increase in mAPs (38.7) and mAPm (48.2), there were decreases in mAPL (53.1). PVT-Tiny achieved a mAP of 36.7, demonstrating impressive performance in mAP50 (53.2) and mAP75 (37.3). In contrast, ResNet18 obtained a modest mAP of 31.2, characterized by lower scores across various evaluation thresholds, except for mAP75 (43.1). On the other hand, ResNet50 achieved a higher mAP of 38.2, showing consistency in mAP, mAP50, and mAP75. Among the PVT models, PVT-Small achieved a mAP of 31.8 with notable performance in mAP50 (52.6), while PVT-Medium achieved a mAP of 32.3, particularly excelling in mAPm (48.5). PVT-Large reached a mAP of 33.5, displaying commendable performance in mAP50 (54.0) and mAP75 (35.1). The results of RetinaNet 1X on the Cityscapes dataset are displayed in Table 6.

Table 6 Results of Cityscapes dataset in comparison with other techniques

4.4.2 RetinaNet 3x:

The IntelPVT-Tiny (ours) displayed advancement with a mAP of 41.5, notable gains in mAP50 (60.1), and mAP75 (43.7), despite encountering slight decreases in mAPs (38.9) and mAPL (53.4). PVT-Tiny achieved a mAP of 36.9, demonstrating strong results in mAP50 (53.4) and mAP75 (37.5). ResNet18 reached a mAP of 31.4, maintaining consistent performance in mAPs (36.9) and mAPm (38.0). In comparison, ResNet50 achieved a mAP of 38.4, excelling notably in mAP50 (50.7) and mAP75 (41.8). Among the PVT models, PVT-Small achieved a mAP of 31.9, demonstrating impressive performance in mAPm (46.3). PVT-Medium attained a mAP of 32.7, particularly excelling in mAP50 (53.1), whereas PVT-Large achieved a performance with a mAP of 33.6, displaying notable strengths in mAPs (37.1) and mAPm (34.8). These results collectively emphasize the distinct performance characteristics of different backbone architectures across diverse evaluation metrics, revealing their specific strengths and limitations in object detection tasks. The results of RetinaNet 3X on the Cityscapes dataset are displayed in  Table 7.

Table 7 Results of Cityscapes dataset in comparison with other techniques

4.5 Key takeaways

To sum it up, IntelPVT-Tiny proves to be a powerful technique for object detection, especially for larger objects. It consistently outperforms other methods across different datasets and evaluation metrics. However, it might struggle a bit when it comes to detecting smaller objects, depending on the dataset. This detailed analysis gives us a clear picture of IntelPVT-Tiny’s strengths and areas where it might need improvement.

5 Conclusion and future work

We proposed IntelPVT, an innovative technique for object detection and classification, where flexible-sized patches are integrated into Pyramid Vision Transformer. We addressed various questions regarding the use of unfixed and flexible size and position patches in vision transformers, such as their necessity, working mechanism, and how they predict the size and position of a patch to preserve an image’s semantic information. We outlined how to embed and incorporate the Intel Patch-Based module into the standard VIT architecture. The technique was evaluated across diverse datasets like MS Coco, Pascal VOC, and Cityscapes. On the MS Coco dataset, it achieved a remarkable mAP of 39.5, demonstrating its effectiveness in capturing intricate object details. In the case of Pascal VOC, IntelPVT excelled with an outstanding mAP50 of 58.3, underscoring its precision in detecting objects with high confidence. Additionally, on the Cityscapes dataset, IntelPVT delivered exemplary results with an mAP70 of 43.7, particularly excelling in recognizing objects at a high confidence threshold. These experimental results showed that IntelPVT surpasses the performance of existing state-of-the-art methods on various metrics.

The technique also exhibited a limitation: the performance needs improvement on identifying and categorizing small to medium-sized objects. In the future, we will investigate the performance of IntelPVT in real life objection detection and classification gadgets and applications. We will also try to improve its performance in applications where small to medium-sized objects are prevalent.