Combining transformer global and local feature extraction for object detection

Li, Tianping; Zhang, Zhenyi; Zhu, Mengdi; Cui, Zhaotong; Wei, Dongmei

doi:10.1007/s40747-024-01409-z

Combining transformer global and local feature extraction for object detection

Original Article
Open access
Published: 15 April 2024

Volume 10, pages 4897–4920, (2024)
Cite this article

Download PDF

You have full access to this open access article

Complex & Intelligent Systems Aims and scope Submit manuscript

Combining transformer global and local feature extraction for object detection

Download PDF

Tianping Li¹,
Zhenyi Zhang¹,
Mengdi Zhu¹,
Zhaotong Cui¹ &
…
Dongmei Wei¹

606 Accesses
Explore all metrics

Abstract

Convolutional neural network (CNN)-based object detectors perform excellently but lack global feature extraction and cannot establish global dependencies between object pixels. Although the Transformer is able to compensate for this, it does not incorporate the advantages of convolution, which results in insufficient information being obtained about the details of local features, as well as slow speed and large computational parameters. In addition, Feature Pyramid Network (FPN) lacks information interaction across layers, which can reduce the acquisition of feature context information. To solve the above problems, this paper proposes a CNN-based anchor-free object detector that combines transformer global and local feature extraction (GLFT) to enhance the extraction of semantic information from images. First, the segmented channel extraction feature attention (SCEFA) module was designed to improve the extraction of local multiscale channel features from the model and enhance the discrimination of pixels in the object region. Second, the aggregated feature hybrid transformer (AFHTrans) module combined with convolution is designed to enhance the extraction of global and local feature information from the model and to establish the dependency of the pixels of distant objects. This approach compensates for the shortcomings of the FPN by means of multilayer information aggregation transmission. Compared with a transformer, these methods have obvious advantages. Finally, the feature extraction head (FE-Head) was designed to extract full-text information based on the features of different tasks. An accuracy of 47.0% and 82.76% was achieved on the COCO2017 and PASCAL VOC2007 + 2012 datasets, respectively, and the experimental results validate the effectiveness of our method.

Feature Enhancement for Multi-scale Object Detection

Article 09 January 2020

Densely convolutional and feature fused object detector

Article 03 September 2019

CR-FPN: channel relation feature pyramid network for object detection

Article 22 June 2020

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Object detection [1] is a fundamental vision task in computer vision that focuses on categorizing objects for recognition and location localization. It has been widely used in the fields of automatic driving, security monitoring, and medical imaging [2]. Recent research on high-performance object detectors [3] has not stopped and still faces great challenges. For example, the problem of optimizing the number of parameters, the speed, and the convergence rate while maintaining high detection accuracy can be addressed. For this reason, much work has been done by researchers studying high-performance object detection algorithms.

In recent years, CNN-based [4] and Transformer-based [5] object detectors have become the current dominant architectures, and models from both architectures have achieved stunning detection results. CNN-based object detectors have evolved from original two-stage detectors (e.g., R-CNN [6] and Cascade R-CNN [7]) to single-stage detectors (e.g., YOLO [8] and RetinaNet [9]) and from anchor-based detectors (e.g., Faster R-CNN [10] and YOLOV4 [11]) to anchor-free detectors (e.g., FCOS [12] and ATSS [13]). Among them, the emergence of anchor-free detectors has led to a further increase in the speed of detection. The above CNN-based object detection algorithms not only have made important breakthroughs in terms of efficiency and accuracy but also have obvious advantages in extracting local and multiscale features. However, the drawback of insufficient ability to extract information globally is still prevalent, and the detection performance still somewhat deviates from that of current object detectors.

Since its introduction into the field, the transformer architecture has become a research hotspot in the field of computer vision [14]. Therefore, researchers started to introduce the Transformer to object detection and proposed a series of Transformer-based object detection algorithms. For the first time, DETR [15] used a transformer for image object detection via a tandem splicing approach involving a CNN followed by a transformer, and the 2D features extracted by the CNN were subsequently input into the transformer, which enhanced the extraction of global features. The disadvantages of these methods include poor detection of small objects, large computational parameters and slow convergence. Deformable DETR [16] fuses the sparse sampling capability of variability convolution with the powerful relational modelling capability of the transformer using a small set of sampling locations as a feature image element that highlights where the key elements are located, resulting in enhanced training effects and small object detection. The disadvantage is that the complexity and number of parameters are large, a large amount of data is needed for training, and the ability to detect small objects needs to be strengthened. Transformers [17] have received a great deal of academic attention due to their simple encoder-decoder architecture and remarkable detection results. The architecture simplifies the detection process and has a unified paradigm, and the ability to capture remote dependencies in captured objects and perform global feature extraction with robust image semantic information makes it easy to achieve highly accurate end-to-end object detection. However, there are also shortcomings in its weak ability to extract multiscale and local features. Although researchers have recently achieved good results in transformer optimization [18], this approach still suffers from the disadvantages of a large number of computational parameters, slow convergence, and a lack of ability to acquire local information (edges and texture) compared to CNN-based detection models, and it is difficult to strike a balance between the detection accuracy and the number of parameters. This shows that although the computational process is greatly simplified, the computational cost of the Transformer itself is still large, and the number of parameters is difficult to reduce. For abbreviations and full names used in this paper, please consult the glossary of terms shown in Table 1.

Table 1 Abbreviations and full names

Combining transformer global and local feature extraction for object detection

Abstract

Similar content being viewed by others

Feature Enhancement for Multi-scale Object Detection

Densely convolutional and feature fused object detector

CR-FPN: channel relation feature pyramid network for object detection

Introduction

Related work

Feature extraction of contextual information

Transformer combined with convolution

Decoupled head

Methods

GLFTNet

Segmented channel extraction feature attention module

Feature grouping

Segmentation channels for extracting multiscale features

Channel feature adjustment

Channel shuffle

The convolution-based aggregated feature hybrid transformer module

Feature extraction head

Experiments

Datasets

Implementation details

Evaluation metrics

Ablation study

SCEFA module

AFHTrans module

FE-Head

Combination of the SCEFA, AFHTrans module and FE-Head

Comparison with other object detectors on the COCO2017 dataset

Conclusion

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation