1 Introduction

In the field of object detection, multi-scale object detection has always been a key and challenging issue. Traditional single-scale detection methods have limitations in dealing with objects with large scale differences, which can easily lead to the problem of missed detection of small objects or inaccurate localization of large objects. Accurately detecting objects at different scales requires overcoming multiple challenges.

Firstly, multi-scale object detection requires effectively handling variations in object scales. Objects may appear in images with different sizes and proportions, so the algorithm needs to have good scale invariance to accurately detect objects at different scales. Secondly, multi-scale object detection also requires handling semantic information and contextual relationships of objects. The semantic information of objects can assist the algorithm in better understanding the features and morphology of objects, while contextual relationships can provide additional information to aid in object localization and classification. In addition, multi-scale object detection also needs to consider issues related to computational efficiency and model complexity. While ensuring detection accuracy, algorithms need to have high computational efficiency to perform fast detection in scenarios with high real-time requirements. At the same time, model complexity needs to be controlled within a reasonable range to avoid excessive consumption of computational resources and memory space.

To address the challenge of detecting objects at various scales in target detection, Lin et al. [1] proposed Feature Pyramid Network (FPN) in 2017, which is utilized for extracting multiple features of distinct scales for target detection by constructing a top-down hierarchical structure. Since FPN was proposed, it has been a top-down fusion. PANet proposed by Liu [2] is the initial model to suggest a bottom-up secondary fusion. Through bottom-up path enhancement, the valuable information from each feature layer is immediately transmitted to the following region proposal network. After that, Ghiasi [3] proposed a feature pyramid structure called NAS-FPN, which is composed of top-down and bottom-up connection combinations for cross-scale feature fusion. Qiao [4] introduced Recursive Feature Pyramid, incorporating supplementary feedback connections in the feature pyramid network into the bottom-up backbone layer. Liu [5] proposed a novel pyramid feature fusion strategy, namely Adaptive Spatial Feature Fusion (ASFF), which addresses inconsistency by acquiring spatial filtering conflict information, consequently enhancing the scale invariance of features. After that, Tan et al. proposed EfficientDet [6], which achieved high-precision target detection while maintaining less floating-point operations. In accordance with varying accuracy needs, the model size was increased from D0 to D7, and the BiFPN layer framework of multi-layer feature fusion was performed on feature maps from diverse layers mainly by using FPN network. In addition, EfficientDet borrows the concept of feature extraction network from EfficientNet and improves on it. EfficientNet achieves efficient and accurate image classification by using composite coefficients to scale the depth, width and resolution of the network. EfficientDet strikes a harmonious equilibrium between high precision and efficiency, however, compared to other versions in the EfficientDet family, such as EfficientDet-D6, EfficientDet-D7, etc., lower versions of EfficientDet have relatively low accuracy due to the smaller network structure and parameter scale. On the other hand, although the higher version of EfficientDet has higher detection accuracy, the increase of its network structure and parameter scale leads to more computing resource requirements. Therefore, for tasks with higher accuracy and limited computing power, the existing lower version of EfficientDet cannot meet the demand. To tackle this issue, the SpanEffiDet algorithm is proposed in this paper.

This paper makes the following key contributions:

  • (1) We propose a Channel Adaptive Frequency Filtering (CAFF) module by embedding CAFF modules between each level of the backbone network, thereby refining the channel elements of EfficientNet. The CAFF module performs frequency-domain transformation of the channel information through Fourier transform, and effectively extracts key features and eliminates redundant information in EfficientNet through semantic adaptive frequency filtering operations. Additionally, this module possesses the capability to compute weights across channels and granularity to capture both fine and coarse details in elemental features.

  • (2) A novel feature pyramid network named Multilevel Cross-bi-directional feature pyramid network (MLC-BiFPN) is proposed for multilayer and multinode feature fusion learning by performing cross-level information transfer, leveraging semantic relationships and spatial information to their fullest extent across distinct levels, the features from these levels are integrated more efficiently, resulting in further improvements in the network’s performance in the context of target detection.

  • (3) Drawing on the idea of Generalized Focal Loss V2 (GFLV2), a Distribution-Guided Quality Predictor (DGQP) is introduced that assesses target localization accuracy by statistically predicting the bounding box distribution of reliable Localization Quality Estimation (LQE) scores, thus better reflecting the positional accuracy of the target boundary. GFLV2 has advantages over traditional Focal Loss when addressing category imbalance and hard case samples.

  • (4) We conducted a thorough assessment of the proposed method and executed experiments on three datasets, namely MS COCO, PASCAL VOC2007 and PASCAL VOC2012. The obtained experimental outcomes offer robust evidence to substantiate the remarkable advantages of SpanEffiDet.

    The remainder of this paper is structured as follows: Sect. 2 introduces a brief summary of related work. Section 3 provides a comprehensive explanation of the Channel Adaptive Frequency Filter, Multi-level cross BiFPN, and Generalized Focal Loss V2. Section 4 carries out an examination of the experimental results. Section 5 gives the conclusions reached in this paper.

2 Related Work

2.1 EfficientDet Network Structure

EfficientDet is considered to be one of the cutting-edge target detection algorithms, which is famous for its simple structure and excellent performance. EfficientDet scales the resolution, depth, and width of the model according to resource constraints by providing seven different versions (from D0 to D6) to meet the detection needs in different scenarios. As depicted in Fig. 1, the model comprises an EfficientNet serving as the backbone network, a bi-directional feature pyramid network (BiFPN), and a network for box and class predictions. Through such a combination, EfficientDet achieves efficient performance in target detection tasks and an equilibrium between speed and precision.

Fig. 1
figure 1

The architectural framework of EfficientDet

2.2 Backbone Network

EfficientNet [7] is a classification network proposed in 2019. Augmenting the neural network’s depth, expanding the feature layer’s width, and enhancing the input image’s resolution have the potential to enhance the network’s detection accuracy [8,9,10]. Nonetheless, this also entails the proliferation of network parameters and an escalation in computational expenses. To increase the efficiency of target detection, it is very important to balance the three dimensions of network width, depth and resolution. EfficientNet cleverly amalgamates these three features and proposes a novel model scaling strategy. First, EfficientNet augments the baseline network’s width by employing a higher number of convolution kernels in each convolutional layer to expand the quantity of feature matrix channels. Second, the depth of the network is increased by adding greater layers to the baseline network. Then, the input image’s resolution is elevated, and the height and width of each feature matrix are increased accordingly. Finally, the width, depth and resolution of the baseline network are simultaneously improved. By combining these scaling strategies, demonstrates improvements while maintaining an equivalent number of flops. The experimental results indicate that combining multiple dimensions is the best scaling method. This adjustment is based on the neural structure search technique [11], which can obtain the optimal composite coefficient. Through this scaling strategy, EfficientNet can achieve better target detection performance with constrained computational resources, as demonstrated in Eqs. (14):

$$ depth:d = \alpha^{\phi } $$
(1)
$$ width:w = \beta^{\phi } $$
(2)
$$ resolution:r = r^{\phi } $$
(3)
$$ s.t.\alpha \cdot \beta^{2} \cdot \gamma^{2} \approx 2 $$
(4)

where \(\alpha\) represents the width scaling coefficient of the network, which is used to govern the baseline network’s width, that is, the quantity of convolution kernels used in each convolution layer, thereby increasing the quantity of feature matrix channels; \(\beta\) represents the depth scaling coefficient of the network, which is used to control the quantity of additional layers added to the baseline structure, thereby increasing the depth of the network; \(\gamma\) represents the resolution scaling coefficient of the network, which is used to control the resolution of the input image and thus affects the height and width of each feature matrix; \(\phi\) is a scaling factor regulating the allocation of computational resources for model scaling; and \({\text{s}}.t.\) represents the restriction condition.

By searching the neural structure of NAS, the optimal parameters of EfficientNet-B0 are \(\alpha = 1.2\), \(\beta = 1.1\) and \(\gamma = 1.15\). The optimal basic model EfficientNet-B0 is when the scaling coefficient \(\phi = 0\). Gradually increasing \(\phi\) is equivalent to increasing the width, depth and resolution of the basic model simultaneously, which increases the model scale and improves the capability accordingly; however, this is also accompanied by the consumption of more computer resources. EfficientDet utilizes the design philosophy of EfficientNet and combines the scaling strategy of the model. Moreover, it performs mixed scaling on EfficientNet, BiFPN, and box/class prediction networks, resulting in eight different versions, namely, EfficientDet D0 to D7. This hybrid scaling strategy enables EfficientDet to flexibly adjust the model structure for different tasks and scenarios according to resource constraints and detection requirements, thus achieving a balance between performance and computational efficiency.

2.3 BiFPN Model

The traditional top-down FPN [1] suffers from the limitation of the unidirectional information flow. To overcome this limitation, PANet [2] introduces an additional bottom-up path aggregation network based on FPN. Ghiasi et al. [3] proposed NAS-FPN to find an improved cross-scale feature network topology through a neural architecture search, but this method requires considerable GPU computing time. In addition, the network structure found by NAS-FPN is irregular and difficult to explain and modify. FPN, PANet and NAS-FPN have gained extensive adoption in the context of multiscale feature fusion. Nonetheless, these methodologies directly amalgamate feature maps during the process of feature fusion without fully accounting for the individual contributions of features at varying resolutions to the resultant output features. To address this issue, Tan proposed BiFPN. As shown in Eq. (5), the importance of each of the distinct input features for feature fusion is determined by computing the weights associated with various feature levels. Specifically, BiFPN combines top-down and bottom-up bi-directional connections to perform feature fusion at each level of the feature pyramid, and introduces learnable weights to adjust the contribution of each of the different levels of features. This design enables the network to flexibly allocate the weight of feature fusion according to the characteristics of the input data to more effectively use multiscale feature information.

$$ O = \sum\nolimits_{i} \frac{\omega _{i} }{\varepsilon + \sum\nolimits_{j}\omega _{j} } \cdot I_{i}$$
(5)

Here, \(\omega_{i}\) represents the learnable weights in the feature fusion process (note \(\omega \ge 0\)), and these weights are constrained by using the ReLU activation function and a simple regularization technique to ensure that they take values ranging from 0–1. To prevent training instability, let \(\varepsilon = 0.0001\). In this way, the network can dynamically acquire and adjust the importance of each of the different layers of features, thus effectively achieving information selection with a trade-off in the feature fusion process.

3 The Overall Architecture of the SpanEffiDet Algorithm

In view of the limitations of the lower versions of EfficientDet, this paper proposes SpanEffiDet. The main structure of SpanEffiDet includes the following aspects. (1) Designing a filtering module–CAFF. By performing frequency domain transformation on channel information and incorporating semantic adaptive frequency filtering operations, it achieves the elimination of redundant information in EfficientNet. (2) A novel feature pyramid network, named MLC-BiFPN, is introduced with the objective of fully exploiting the semantic relationship and location information between different levels for feature fusion learning of different levels and nodes to enhance information exchange and the fusion ability between features of different levels, integrate the features of different levels more efficiently and achieve more efficient target feature extraction. (3) The loss function part enhances the detection capability of the algorithm by introducing GFLV2. The DGQP in GFLV2 predicts reliable LQE scores through statistical analysis of a bounding box distribution, which can more accurately evaluate the accuracy of target localization to better address unbalanced data and difficult samples. Compared with traditional Focal Loss, GFLV2 can better adapt to different target detection scenarios. The improved network is called SpanEffiDet, and its overall network architecture is depicted in Fig. 2.

Fig. 2
figure 2

Structure of SpanEffiDet. “CAFF” is a channel-adaptive frequency filtering module proposed by us, and different colors of CAFF represent filtering operations on channel information between different levels of the backbone network

3.1 Channel Adaptive Frequency Filter block

[12,13,14] decrease the neural networks’ expressive capacity by reducing the range of mixed tokens in different ways. Utilizing the associative nature of matrix product [15] and low-rank approximation methods [16] can also reduce the complexity of matrix manipulation in self-attentive mechanisms. However, the above methods will lead to unsatisfactory network detection results and cannot fully utilize the expression performance of the model. Inspired by the convolution theorem in signal processing tasks [17], we propose that convolution operations and Hadamard product operations can be realized through Fourier transformation on tokens in the latent space. In the adaptive instance mask processing of CAFF, channel tokens are first transformed into the Fourier domain, followed by utilizing frequency filtering as a large-kernel dynamic convolution for token mixing. Due to the utilization of Fast Fourier Transform (FFT), the intricacy of token mixing is reduced from \({\mathcal{C}}\left( {W^{2} } \right)\) to \({\mathcal{C}}\left( {W\log W} \right)\), enhancing computational efficiency.

CAFF transfers potential elements (i.e., a group of tokens) within the channel from their initial latent space to a frequency space through a spatial two-dimensional discrete Fourier transform. This allows obtaining a frequency representation where spatial positions correspond to different frequency components. CAFF dynamically acquires instance masks from these frequency representations, followed by calculating the Hadamard product between the acquired mask and the frequency representation for semantic adaptive frequency filtering. The tokens, after being filtered, undergo a Fourier inverse transform to return to the initial latent space. The outcomes following the inverse transform can be regarded as the result of the fusion of tokens with deep convolutional kernels. The specific composition of CAFF is illustrated in Fig. 3.

Fig. 3
figure 3

Composition of CAFF. Where “filtering” denotes the frequency-domain transformation filtering of the channel information

Conditional channel weighting in CAFF dynamically adjusts computational weights between channels to better capture potential feature correlations. Additionally, the conditional channel weighting mechanism reduces the impact of redundant information on channels when processing channel information in CAFF, further enhancing the efficiency and expressive capacity of the model.

The input image of the neural network is considered as a feature tensor \({\mathcal{Y}} \in {\mathbb{R}}^{H \times W \times C}\) with a resolution of \(H \times W\) and a channel count of \(C\), where each potential element is represented as \(y \in {\mathbb{R}}^{1 \times 1 \times C}\). Within a neighboring region \(U\left( {y^{m} } \right)\), the token mixing updated \(y^{m}\) is denoted as:

$$\hat{y}^{m} = \sum\limits_{{j \in U\left( {y^{m} } \right)}} {\varphi ^{{j \to m}} \otimes \psi \left( {y^{j} } \right)}$$
(6)

Here, \(\hat{y}^{m}\) represents the updated \(y^{m}\), \(y^{j}\) signifies the potential elements in \(U\left( {y^{m} } \right)\), \(\psi \left( \cdot \right)\) is an embedded function, and \(\varphi^{j \to m}\) represents the weights for information fusion during the process of elements transitioning from being labeled as \(y^{j}\) to being updated as \(y^{m}\). The symbol \(\otimes\) denotes the Hadamard product or matrix multiplication.

For simplifying the understanding of the network processing, we mix tokens for channel elements through global convolution, denoted as \({\hat{\mathbf{Y}}} = {\mathcal{O}} * {\mathbf{Y}}\). For any channel position element \({\mathbf{Y}}(h,w)\), Eq. (6) can be expressed as:

$$\hat{Y}(h,w) = \sum\limits_{{h^{\prime} = - \left[ {\frac{H}{2}} \right]}}^{{\left[ {\frac{H}{2}} \right]}} {\sum\limits_{{w^{\prime} = - \left[ {\frac{W}{2}} \right]}}^{{\left[ {\frac{W}{2}} \right]}} {{\mathcal{O}}(h^{\prime},w^{\prime})Y(h - h^{\prime},w - w^{\prime})} }$$
(7)

Here, \({\hat{\mathbf{Y}}}\left( {h,w} \right)\) represents the elements updated through the blending of token elements \({\mathbf{Y}}\left( {h,w} \right)\). \(H\) and \(W\) represent the height and width of the input tensor, respectively. \({\mathcal{O}}(h^{\prime},w^{\prime})\) is the weight value of the element during the mixing process, which is achieved through a global convolution kernel with a spatial size identical to that of \({\text{Y}}\).

In CAFF, the blending of token labels is a globally semantic adaptive process, and the weight value \({\mathcal{O}}\) for token blending can adaptively adjust slightly larger than the spatial size of \({\mathbf{Y}}\).The adaptive process of \({\mathcal{O}}\) with respect to \({\mathbf{Y}}\) is typically achieved using dynamic convolutional kernels [18] and large-kernel convolutions, but this poses the issue of expensive computational costs. Therefore, we introduce a more efficient method as a replacement using the convolution theorem [19].

The convolution theorem asserts that a convolution operation performed in one domain is mathematically equivalent to the Hadamard product in its respective Fourier domain. Building upon this theoretical foundation, we introduce the CAFF block, which purifies the channel elements of the backbone network by performing frequency domain filtering operations on EfficientNet. For elements \({\mathbf{Y}} \in {\mathbb{R}}^{H \times W \times C}\) in the latent space, we employ FFT with \({\mathbf{Y}}_{F} = {\mathcal{F}}\left( {\mathbf{Y}} \right)\) to obtain the corresponding frequency \({\mathbf{Y}}_{F}\).

$$ Y_{F} \left( {\alpha ,\beta } \right) = \sum\limits_{h = 0}^{H - 1} {\sum\limits_{w = 0}^{W - 1} {Y(h,w)e^{{ - 2\pi i\left( {\alpha h + \beta w} \right)}} } } $$
(8)

Here, features from diverse spatial locations in \({\mathbf{Y}}_{F}\) correspond to distinct frequency components of \({\mathbf{Y}}\), encompassing the overall information of \({\mathbf{Y}}\), employing a transformation with complexity \({\mathcal{C}}(W{\text{log}}W)\).

After Fourier transform convolution, effective global token mixing of \({\mathbf{Y}}\) is achieved by filtering its frequency \({\mathbf{Y}}_{F}\) through adaptive instance masks. Subsequently, the filtered \({\mathbf{Y}}_{F}\) is further subjected to Fourier inverse transform, resulting in the updated elemental features \({\hat{\mathbf{Y}}}\) in the latent space. The above procedure can be represented as:

$$ \hat{Y} = {\mathcal{F}}^{ - 1} \left[ {{\mathcal{N}}\left( {{\mathcal{F}}\left( Y \right)} \right) \otimes {\mathcal{F}}\left( Y \right)} \right] $$
(9)

where \({\mathcal{N}}\left( {{\mathcal{F}}\left( {\mathbf{Y}} \right)} \right)\) is the mask tensor obtained from \({\mathbf{Y}}_{F}\), having the same size as \({\mathbf{Y}}_{F}\). \({\mathcal{N}}\left( \cdot \right)\) is realized through a set of linear layers, followed by the ReLU function and another set of linear layers. The symbol \(\otimes\) denotes the Hadamard product, and \({\mathcal{F}}^{ - 1} \left( \cdot \right)\) is the Fourier inverse transform.

3.2 Construction of the Neck Network MLC-BiFPN

BiFPN is a feature pyramid network introduced in EfficientDet that further enhances the feature fusion mechanism of PANet. PANet [2] introduces a top-down and bottom-up integration process based on FPN [1] to effectively fuse multiscale features, while BiFPN is optimized on the basis of PANet. Through the elimination of nodes featuring solely a single input edge and introducing cross-node connections, feature fusion is further enhanced. In addition, BiFPN utilizes an iterative stacking module mechanism to improve the model for more efficient feature fusion without the addition of more computational complexity. Figure 4 shows the details of the structure and design of BiFPN.

Fig. 4
figure 4

Structure of BiFPN

For the feature fusion process of BiFPN in EfficientDet, Tan believes that each node has different degrees of influence on the entire feature network, so simple additive fusion is not enough to express its importance. To this end, Tan proposed a method called “fast normalized fusion”, which assigns appropriate weights to each node by learning the obtained weights and then uses these weights in the feature fusion stage. The fast normalized fusion formula is given in Eq. (5). In the case of Eqs. (10) and (11), for example, each node has a separate weight.

$$ P_{6}^{td} = Conv(\frac{{\omega_{1} \cdot P_{6}^{in} + \omega_{2} \cdot {\text{Re}} \, size(P_{7}^{in} )}}{{\omega_{1} + \omega_{2} + \varepsilon }}) $$
(10)
$$ P_{7}^{out} = Conv\left( {\frac{{\omega_{1}{\prime} \cdot P_{7}^{in} + \omega_{2}{\prime} \cdot {\text{Re}} size\left( {P_{6}^{out} } \right)}}{{\omega_{1}^{{^{\prime } }} + \omega_{2}^{{^{\prime } }} + \varepsilon }}} \right) $$
(11)

Here, \(P_{6}^{td}\) represents the intermediate feature extracted from the sixth layer within the top-down pathway, while \(P_{7}^{out}\) signifies the output feature derived from the seventh layer within the bottom-up pathway. All remaining features are integrated using a consistent approach. Additionally, \(P_{6}^{td}\) is the sum of \(P_{6}^{in}\) multiplied by the weights \(\omega_{1}\) and \(P_{7}^{in}\) multiplied by the adjusted weight \(\omega_{2}\) and finally divided by the summation of the weights \(\omega_{1}\) and \(\omega_{2}\).

Based on the idea of BiFPN, this paper adds span-level information transmission, allowing information on different levels to interact to comprehensively convey the semantic information and geometric information between different levels, aiming to effectively transmit the feature representation of the upper and lower layers. In this process, high-resolution feature maps are leveraged for reutilization, which we esteem is crucial to improve the detection performance. Specifically, AvgPool is utilized to obtain global contextual information, which is rich in semantic information and can understand the content and context of the image as a whole. Subsequently, by using a sigmoid function as an activation function, this global context information is transformed into weights so that the low-level features benefit from the guidance of the overarching contextual information, and the expressive capability of the features is further enhanced. This combination of overarching contextual information and the sigmoid function effectively facilitates the process of feature map generation.

In addition, to balance the input information transmission of feature nodes, MLC-BiFPN adds top-down and bottom-up span-type information transmission. The top-down information transmission mechanism serves to ameliorate the deficiency in semantic information at the high-level nodes, thereby augmenting the feature’s semantic expressiveness. Bottom-up information transmission is conducive to supplementing the missing positional information of the underlying nodes and helping to maintain the geometric details of the features. In this way, the disparity between the most superficial large-scale feature maps and the deepest small-scale feature maps is effectively reduced to realize effective interaction between different levels of features and better retain shallow and high-level feature information. Moreover, the weight distribution is also considered when performing cross-layer information transfer. To combine low-resolution features with comprehensive semantic information with high-resolution features with weaker semantic particulars, we fused the outermost lower and higher nodes to form a feature pyramid with full-scale semantic entity information. This approach yields a more cohesive and enhanced feature representation, facilitating the detector in capturing target features across various scales with heightened precision. Consequently, it contributes to the overall enhancement of detection precision and robustness.

Taking \(P_{6}^{td}\) and \(P_{7}^{out}\) as examples, the improved feature fusion is:

$$ P_{6}^{td} = Conv\left( {\frac{{\omega_{1} \cdot P_{6}^{in} + \omega_{2} \cdot {\text{Re}} size\left( {P_{7}^{in} } \right) + \omega_{3} \cdot \sigma \left[ {Avgpool\left( {P_{7}^{in} } \right)} \right]}}{{\omega_{1} + \omega_{2} + \omega_{3} + \varepsilon }}} \right) $$
(12)
$$ P_{7}^{out} = Conv\left( {\frac{{\omega_{1}{\prime} \cdot P_{7}^{in} + \omega_{2}{\prime} \cdot {\text{Re}} size\left\{ {{\text{Re}} size\left[ {{\text{Re}} size\left( {P_{4}^{td} } \right)} \right]} \right\} + \omega_{3}{\prime} \cdot {\text{Re}} size\left( {P_{6}^{out} } \right) + \omega_{4}{\prime} \cdot {\text{Re}} size\left[ {{\text{Re}} size\left( {P_{3}^{out} } \right)} \right]}}{{\omega_{3}{\prime} + \omega_{3}{\prime} + \omega_{3}{\prime} + \omega_{3}{\prime} + \varepsilon }}} \right) $$
(13)

where \(\sigma\) denotes the sigmoid function, AvgPool represents the global average pooling, and \({\text{Re}} {\text{size}}\) usually refers to the upsampling or downsampling process in the resolution matching process with the objective of adapting the dimensions of the feature map. \(Conv\) usually represents the convolution manipulation for feature processing, which is used to transform the information in the depth direction of the feature map. The network structure of MLC-BiFPN is depicted in Fig. 5.

Fig. 5
figure 5

Structure of MLC-BiFPN

3.3 Introduction of the Generalized Focal Loss V2 Loss Function

GFLV2 [20] introduces DGQP to measure the quality of target localization. By predicting reliable LQE scores through boundary box distribution statistics, it can more accurately assess the accuracy of target localization, thus more precisely reflecting the positional accuracy of target boundaries. Compared to the traditional Focal Loss [21], the lightweight design of DGQP brings negligible additional computational costs in practice. GFLV1 [22] guides model training by obtaining the LQE score as the position score for detection boxes. GFLV2 decomposes the joint feature representation in GFLV1, which can alleviate the inconsistency of detection boxes between the training and testing stages.

The joint representation is shown in Eq. (14). When the predicted category is the true category, the joint representation feature \(J = IOU\); otherwise, it is 0.

$$ {\text{J}}_{{\text{i}}} { = }\left\{ {\begin{array}{*{20}l} {{\text{IOU}}\left( {{\text{b}}_{{{\text{pred}}}} {\text{,b}}_{{{\text{gt}}}} } \right){,}} \hfill & {\text{if i = c;}} \hfill \\ {0,} \hfill & {\text{ otherwise;}} \hfill \\ \end{array} } \right. $$
(14)

Here, \({\text{IOU}}\left( {{\text{b}}_{{{\text{pred}}}} {\text{,b}}_{{{\text{gt}}}} } \right)\) denotes \(IOU\) between the predicted bounding box \(b_{pred}\) and the ground truth \(b_{gt}\).

Although the joint expression in V1 can address inconsistencies during training and testing, it is imperative to acknowledge certain constraints associated with relying solely on classification scores to control joint expression. Therefore, in V2, Li uses the features of both the classification branch \((C)\) and the regression branch \(\left( I \right)\) (obtained after the output of DGQP) as the joint expression, and although \(J\) is composed of two factors, Li uses them both in the training and testing phases to avoid inconsistency.

$$ J = C \times I $$
(15)

Here, \(I \in \left[ {0,1} \right]\) denotes the information of the \(IOU\) feature, i.e., the quality estimate of \(IOU\).

DGQP, a key component of GFLV2, passes the statistical information of the learned total distribution \(T\) to a very small subnetwork (shown in Fig. 6) to acquire the predicted \(IOU\) scalar \(I\), which generates a high-quality Classification-IoU Joint Representation (Eq. (15)). The module takes the statistical information of the distribution as the input and the \(IOU\) feature as \(I\) the output. The left, right, upper and lower are marked as \(\left\{ {l,r,u,v} \right\}\), and the discrete probability on the \(\omega\) side is defined as:

$$ \left. {T^{\omega } = \left[ {\begin{array}{*{20}c} {T^{\omega } \left( {x_{0} } \right),T^{\omega } \left( {x_{1} } \right), \ldots T^{\omega } \left( {x_{n} } \right)} \\ \end{array} } \right.} \right] $$
(16)

where \(\omega \in \left\{ {l,r,u,v} \right\}\).

Fig. 6
figure 6

The structure of DGQP

The \(Top - k\) values and the average value of each distribution vector \(T^{\omega }\) are selected and concatenated as the fundamental statistical feature \(F \in {\mathbb{R}}^{{4\left( {k + 1} \right)}}\).

$$ F = Concat\left( {\left\{ {Topkm\left( {P^{\omega } } \right)\left| {\omega \in \left\{ {l,r,u,v} \right\}} \right.} \right\}} \right) $$
(17)

In this formula, \(Topkm\left( \cdot \right)\) is a joint operation for calculating the value of \(Top - k\) and its average value, and \(Concat\left( \cdot \right)\) denotes channel concatenation.

There are two advantages to choosing the \(Top - k\) value and its average value as input statistics. First, the sum of \(T^{\omega }\) is fixed (\(\sum\nolimits_{i = 0}^{n} {T^{\omega } \left( {x_{i} = 1} \right)}\)), and the \(Top - k\) value and the mean value can well portray the shape of the distribution, such as being large, small, slow, steep, etc. Second, the \(Top - k\) value and mean value are not sensitive to the offset of the distribution, which can maintain better robustness to the scale of the target.

After obtaining the statistical feature \(F\), a network \({\mathcal{F}}\left( \cdot \right)\) is designed for the purpose of estimating the last \(IOU\) quality. The network consists of two fully connected (FC) layers, followed by ReLU and sigmoid layers. Therefore, the \(IOU\) scalar \(I\) can be computed as:

$$ I = {\mathcal{F}}\left( F \right) = \sigma \left( {W_{2} \delta \left( {W_{1} F} \right)} \right) $$
(18)

where \(\delta\) and \(\sigma\) are the ReLU and sigmoid layers, respectively, \(W_{1} \in {\mathbb{R}}^{{P \times 4\left( {k + 1} \right)}}\), \(k\) is the \(Top - k\) parameter, and \(W_{2} \in {\mathbb{R}}^{1 \times P}\), \(p\) is the channel dimension of the hidden layer.

4 Experiments

The datasets used and experimental details are first described in this section. Multi-group ablation experiments are then performed on MS COCO2017 to validate the effectiveness of each component in SpanEffiDet and its impact on overall performance. Second, SpanEffiDet is compared with other advanced algorithms within the current target detection field, and qualitative experiments are completed. Finally, generalization experiments are executed on PASCAL VOC2007 and PASCAL VOC2012 to compare each category with other algorithms to further validate the superiority of SpanEffiDet.

4.1 Datasets and Evaluation Protocol

To validate the proposed approach, three datasets that are widely used in target detection tasks are used in this paper, namely, MS COCO [23], PASCAL VOC2007 and PASCAL VOC2012 [24, 25].

The MS COCO dataset is a extensively employed large-scale target detection, segmentation and keypoint detection dataset. It is provided by Microsoft and is one of the most representative and challenging datasets in computer vision today. The COCO dataset encompasses 115 k images designated for training purposes, while another 5 k images are allocated for verification. Additionally, 20 k images are earmarked for the test-dev set, with an additional 20 k images dedicated to the test challenge set. The biases in the MS COCO dataset mainly manifest in the aspects of class distribution, scale variation, and pose diversity. Firstly, the imbalance in the distribution of different object categories in the dataset leads to a shortage of samples for some categories, while others have a surplus of samples. This imbalance can affect the detection performance of the model for less common object categories. Secondly, the scale variation and pose diversity issues in the dataset will have a detrimental impact on the model’s multi-scale fusion and algorithm robustness. Although the MS COCO dataset has certain biases, it contains a large number of images and object instances, covering diverse real-world scenarios. Additionally, the dataset has high-quality annotations with detailed object location information, category IDs, etc., making it one of the recognized benchmark datasets in the field of deep learning. Therefore, the model proposed in this paper is trained on train2017 and evaluated on val2017. During the evaluation process, COCO-compliant evaluation metrics are used to assess the average accuracy for different IOU thresholds, including 10 values of 0.50:0.05:0.95 (AP), 0.5 (AP50) and 0.75 (AP75). Furthermore, the average accuracy for objects across various scales is calculated, including APS, APM, and APL for small, medium, and large objects (area < 322,322 < area < 962, and 962 < area).

PASCAL VOC is partitioned into two editions, namely the 2007 and 2012 versions, encompassing a combined total of 20 distinct object categories. The PASCAL VOC 2007 dataset comprises 9,963 images and encompasses 24,640 target objects, while the PASCAL VOC 2012 dataset consists of 11,540 images and encompasses 27,450 target objects. The PASCAL VOC2007 and PASCAL VOC2012 datasets also exhibit some biases, primarily related to category distribution, image quality, and object scale. Firstly, the datasets suffer from an imbalance in category distribution, leading to subpar performance of models on less common categories. Secondly, there are issues with image quality in the datasets, including noise, blurriness, and lighting variations, which negatively impact the model’s generalization ability. Additionally, the problem of object scale in the datasets also affects the fusion of multi-scale features in the model.

Despite the inherent biases in the PASCAL VOC2007 and PASCAL VOC2012 datasets, they encompass a wide range of object categories and real-world scenarios, providing an intuitive reflection of the diversity and complexity in the real world. This is advantageous for evaluating the model’s generalization ability across different backgrounds. Additionally, the PASCAL VOC datasets are widely recognized as benchmark datasets in the fields of object detection and classification, enjoying high visibility and extensive usage. Conducting experiments on these datasets enhances the comparability and credibility of research results, facilitating fair comparisons and evaluations with other algorithms. Therefore, in this paper, generalizability experiments are conducted using the PASCAL VOC2007 and PASCAL VOC2012 datasets.

4.2 Implementation Details

SpanEffiDet is implemented on the PyTorch deep learning framework, and the hardware configuration for the experiment is as follows: an Intel® Xeon® Gold 6348 CPU @ 2.60 GHZ, 32 GB of RAM, an NVIDIA GeForce RTX 3080Ti graphics card, Windows 10 operating system, python3.8, anaconda3, CUDA11.7, cudnn8.0.5, PyTorch11.7, and other Python libraries. Migration learning is introduced during the training process. The utilization of freeze training is employed to expedite the training process and safeguard the integrity of the model’s weights during the initial training phase. According to the hardware conditions of the laboratory equipment and the dataset’s scale, the batch size for the freeze training phase is configured at 8, which is subsequently reduced to 4 after thawing. The training process spans 50 epochs. Epochs 0–25 are for freeze training, and epochs 26–50 are for thaw training. Every model is trained with the Adamax optimizer, initialized with a learning rate of 3e-4. The learning rate undergoes annealing in accordance with the cosine decay rule. Due to hardware constraints, the SpanEffiDet model could only be trained to version D2 in the experiments.

4.3 Performance Analysis of the SpanEffiDet Algorithm Proposed in this Paper

In this section, to fully demonstrate the efficacy of SpanEffiDet, this paper compares the accuracy and efficiency of SpanEffiDet with other object detection methods on the PASCAL VOC and MS COCO datasets.

First, initial experiments on EfficientDet were carried out on the publicly available MS COCO dataset, and in this paper, the versions of D0, D1, D2 and D3 were trained. The outcomes are presented in Table 1. As the EfficientDet version deepens, the AP becomes higher while the FPS decreases.

Table 1 The experimental outcomes for different versions of EfficientDet

Although EfficientDet-D3 has the highest accuracy, such high accuracy comes at the expense of a substantial amount of computing resources and a relatively complex training procedure. To explore the improvement in a more lightweight model, this paper selects EfficientDet-D0 as the baseline model and conducts ablation experiments. Through ablation experiments, the contribution of each model to model performance can be analyzed in more detail. In addition, we also conducted a comparative analysis between SpanEffiDet and the baseline model from three perspectives: time complexity, algorithm efficiency, and number of parameters, in order to reveal the differences in advantages and disadvantages between them. According to the results of the comparison experiments in Table 2, SpanEffiDet outperforms the baseline model in terms of time complexity, algorithmic efficiency and number of parameters, and this result further validates the superior performance of SpanEffiDet.

Table 2 Comparison between time complexity, algorithm efficiency and number of parameters

By visualizing the features learned by the network, we can better understand the target localization and classification process of the network in the detection process, as well as the ability of the network at different depth levels to extract and express the target features. We selected shallow, medium and deep level feature maps for visualization effect comparison respectively, where regions without lines represent background areas and regions with lines represent target areas. This analysis helps to deepen the understanding of the network’s ability to discriminate between target and background during the detection process at different depth levels, which in turn improves the interpretability of the model. Figure 7 shows the visualization results of feature maps before and after adding MLC-BiFPN. Through the comparison of visualized results, we observed that after adding MLC-BiFPN, the features extracted by the network became more enriched. This advantage stems from the meticulous design of MLC-BiFPN, which effectively facilitates the interaction of information between different levels. Specifically, MLC-BiFPN enables the fusion of low-resolution features with complete semantic information and high-resolution features with weaker semantic details, allowing the network to more fully extract feature information.

Fig. 7
figure 7

Feature map visualization results before and after adding MLC-BiFPN

4.4 Ablation Experiment

To substantiate the validity of individual modules in the algorithm on the performance of the whole model, we add these modules to the original algorithm one by one and use the evaluation index of the COCO standard as the performance index to test at each stage. The CAFF, MLC-BiFPN and GFLV2 loss functions are added in turn. As shown in Table 3, the experimental results show that each module significantly enhances the performance of the whole model. In particular, the MLC-BiFPN module brings the most notable enhancement to the entire model. In the EfficientDet-D0 model, the MLC-BiFPN module improves the AP by 3.2%, followed by the introduction of CAFF, which improves the model AP by 2.8%. Finally, the GFLV2 module improves the model accuracy by 2.8%. Specifically, the AP75 and APS values have increased by 3.6% and 1.5% respectively, indicating that SpanEffiDet contributes to better performance of the model under stricter IoU standards and when detecting small objects.

Table 3 The outcomes of ablation studies on D0

The CAFF module proposed in this paper conducts Fourier transform on images in the frequency domain, followed by inverse transformation and semantic adaptive frequency filtering operations, effectively extracting key features and removing redundant information from EfficientNet. Secondly, building upon BiFPN, this study introduces cross-level information exchange by incorporating both top-down and bottom-up cross-path information propagation, proposing a structure named MLC-BiFPN. This operation allows for interaction among different levels, facilitating the comprehensive expression of semantic and geometric information between different levels, aiming to effectively convey feature representations between upper and lower layers. It also achieves balance in the transmission of input information among feature nodes. This innovation is more suitable for complex scenarios and feature expressions at different levels compared to the original EfficientDet algorithm. To demonstrate the advantages of SpanEffiDet, we compare the main features of SpanEffiDet and EfficientDet in Table 4, highlighting their differences and limitations, and providing corresponding solutions.

Table 4 Comparison of the differences between the main features of SpanEffiDet and EfficientDet

4.5 Comparison with Other Target Detection Algorithms

After completing the comparison experiment with the original algorithm, SpanEffiDet is further compared with other target detection algorithms in detail. To ensure fairness, the results are compared on the identical training set and verification set. The outcomes are presented in Table 5, which exhibits the capability of SpanEffiDet on different indicators compared with the present mainstream object detection models. The table reveals that the SpanEffiDet algorithm, introduced in this paper, surpasses the performance of other object detection algorithms. Although the experiment is resource-limited and only goes to the D2 version, it is anticipated that the performance of the model after D2 will be further enhanced. Of note, the Faster R-CNN and Mask R-CNN two-stage frameworks, YOLOv3 and YOLOv4 one-stage frameworks, anchor-free object detector CenterNet [26] and other object detection algorithms failed to perform better than SpanEffiDet proposed in this paper in terms of precision.

Table 5 Comparison of SpanEffiDet with other object detection algorithms

As shown in Table 5, SpanEffiDet-D3 achieves 49.3% AP on the MS COCO dataset. Compared to classic object detection algorithms such as CenterNet, SSD, Faster R-CNN, and Mask R-CNN, the scientifically rational network architecture of SpanEffiDet yields significantly superior performance across various metrics. In addition, the baseline model EfficientDet has certain advantages in the field of object detection, but it also has some limitations. One of these limitations is its poor performance in handling small or dense objects, as EfficientDet’s design is more focused on detecting large objects. Additionally, due to insufficient processing of feature information, EfficientDet’s detection performance is suboptimal in some scenarios, and redundant information adds burden to inference speed. In contrast, our proposed SpanEffiDet can be seen by the results of ablation experiments that its performance is improved thanks to the three mechanisms of CAFF, MLC-BiFPN, and GFocalV2 loss function. The CAFF module filters the latent spatial information of the backbone network, achieving feature extraction and redundant information removal. This approach effectively optimizes the network structure and enhances the model’s performance in object detection tasks. MLC-BiFPN facilitates interaction between different levels, combining low-resolution features with complete semantic information and high-resolution features with weaker semantic details. This enables the model to better handle the detection of small or dense objects. The GFocalV2 loss function introduces the DGQP structure to evaluate the accuracy of bounding boxes, contributing to the improved detection precision of SpanEffiDet. Compared to the EfficientDet-D0, EfficientDet-D1, EfficientDet-D2, and EfficientDet-D3 models, SpanEffiDet also demonstrates its excellence, with AP increases of 3.3%, 3.1%, 2.9%, and 2.8%, respectively.

4.5.1 Comparison with State-of-the-Art algorithms

Table 6 presents the comparison of detection results between SpanEffiDet and state-of-the-art object detection methods. (1) Comparisons to CNN-based Detectors. The CNN-based object detection model can gradually extract image features through multi-layer convolution and pooling operations to achieve effective identification and localization of objects. Additionally, CNNs can preserve spatial information efficiently when processing images, capturing features from different regions of the image through convolutional operations. The shared parameters of convolutional kernels also make the models more compact and efficient. However, these algorithms have some drawbacks in practice. For example, while YOLOv7 and YOLOv8 have improved in speed, they do not match the detection accuracy of more complex algorithms. YOLOvf is limited by its model architecture and struggles to handle specific scenes or complex targets effectively. VarifocalNet, although introducing a focus mechanism to improve detection accuracy, increases model complexity and training costs, requiring more computational resources. SpanEffiDet integrates the advantages of CNNs, where CAFF refines the latent spatial information processed by CNNs, MLC-BiFPN facilitates span-scale and span-path feature fusion, and GFocalV2 loss function further adaptively evaluates and adjusts SpanEffiDet’s detection accuracy for bounding boxes. With the same training epochs, SpanEffiDet outperforms models like YOLOv7, YOLOv8, and VarifocalNet [44], surpassing YOLOv7 by 9.6% AP, YOLOv8 by 11.9% AP, and VarifocalNet by 13.6% AP in detection results.

Table 6 Comparison of SpanEffiDet with advanced target detection algorithms

4.5.2 Comparisons to Transformer-Based Detectors

Transformer-based object detection models can simultaneously consider all positional information in the input sequence, unlike CNNs that capture features through local receptive fields. Additionally, by effectively handling long-range dependencies, they help the model better understand contextual information and semantic correlations of objects. However, these models have higher computational complexity and rely on position encoding for handling positional information, posing significant challenges for researchers in terms of model training and deployment. In SpanEffiDet’s backbone network, CAFF significantly reduces parameter count and computational complexity by filtering latent network information. After the enhancement of three mechanisms, CAFF, MLC-BiFPN, and GFocalV2 loss function, SpanEffiDet outperforms the AP with Conditional DETR 4.6% and the AP with Anchor DETR 5.0%.

Figure 8 illustrates the comparison results between our proposed algorithm and state-of-the-art algorithms in terms of detection accuracy and speed. From the graph and experimental results, it is evident that our proposed algorithm shows significant improvement. Compared to the original EfficientDet algorithm, both detection accuracy and speed have been enhanced. These results further validate the effectiveness of our approach.

Fig. 8
figure 8

Comparison with state-of-the-art algorithms in terms of detection accuracy and detection speed

4.6 Performance Testing Based on the PASCAL VOC Dataset

To verify the universality of the SpanEffiDet model, this paper also conducts parallel experiments on the PASCAL VOC2007 and PASCAL VOC2012 datasets. The results of comparison of the proposed algorithm and other object detection algorithms on the Pascal VOC2007 dataset are presented in Table 7 to show the reliability of the proposed algorithm. In addition, in Table 8, for PASCAL VOC2012 val images, the EfficientDet algorithm reproduced in this paper and SpanEffiDet proposed in this paper adopt the evaluation index of the COCO standard and compare the mean accuracies of different categories (mAP, IoU = 0.5). It is clear that SpanEffiDet generally achieves excellent performance across all object categories, with its highest mAP reaching 87.48%, which is 4.2% higher than the best performing EfficientDet-D3 algorithm. The experimental results prove that this method is capable of fulfilling detection requirements across different scenarios with high accuracy and performance.

Table 7 Performance comparison results on PASCAL VOC2007 val images
Table 8 Comparison of the EfficientDet reproduced in this paper and the SpanEffiDet proposed in this paper using the assessment criteria of the COCO standard and the value of mAP (IoU = 0.5) on PASCAL VOC2012 val images

4.7 Performance Analysis After the Introduction of GFLV2

In this paper, GFLV2 is also introduced as an improved loss function, and its performance is tested. Figure 9 shows the experimental results under different epochs to finally obtain the best average precision (AP) of 36.2%. It is noteworthy that while the performance of GFLV2 exhibits minor variations across different epochs, its accuracy is always better than that of Focal loss before improvement. In addition, even in the initial training epoch, the worst performance achieved by GFLV2 exceeds the highest performance achieved by the original Focal Loss. This shows that GFLV2 has better performance in different training stages and a noticeable enhancement in accuracy compared with traditional Focal loss. Figure 10 shows the loss curves of the algorithm before and after the introduction of GFocalV2. Observing the changes in the graph reveals that the improved algorithm exhibits a faster rate of loss reduction and achieves a lower final convergence value. The experimental results further verify the validity of GFLV2 as a loss function and provide stronger support for the algorithm in this paper.

Fig. 9
figure 9

Comparison of AP of different epochs

Fig. 10
figure 10

Comparison of loss curves of the algorithms before and after the introduction of GFocalV2

4.8 Qualitative Results on COCO

In this paper, we visualize and show the experimental results of the comprehensive analysis in Fig. 11. The experimental objects cover four scenarios, including dense near-range targets, sparse near-range targets, dense far-range targets and sparse far-range targets. In the close-range target-dense scene, although all methods successfully detected objects, the Faster R-CNN and CenterNet algorithms still exhibited false-positive detections. In addition, in other scenes, all other algorithms exhibit relatively low detection accuracy and high missed detection rates. For example, in the long-range target-sparse scenario, the SSD algorithm misses 13 targets, the RetinaNet algorithm misses 4 targets, the Faster R-CNN algorithm misdetects 1 target, the YOLOv4 algorithm misses 5 targets, and the EfficientDet-D0 algorithm misdetects 1 target. In contrast, SpanEffiDet proposed in this paper achieves greater detection accuracy and a lower missed detection rate in all these scenarios (only 1 target is missed in the long-range target-intensive scenarios). Especially in the detection scenarios of tiny objects, occluded objects and objects with incomplete information, the method in this paper shows good detection performance. These results show that SpanEffiDet can produce more realistic and accurate detection results in various complex scenarios.

Fig. 11
figure 11

Visual detection outcomes from various algorithms across four distinct scenarios: a dense close-range targets; b sparse close-range targets; c dense long-range targets; and d spare long-range targets. From the first line to the seventh line: SSD results, RetinaNet results, Faster R-CNN results, CenterNet results, YOLOv4 results, EfficientDet-D0 results, and SpanEffiDet-D0 results

To further demonstrate the effectiveness of the SpanEffiDet algorithm in small object detection, we have included visual detection comparison results on small object images, as shown in Fig. 12. In each comparison graph, the left side displays the results of the EfficientDet model, while the right side shows the results of our proposed SpanEffiDet algorithm.

Fig. 12
figure 12

Visualization results of SpanEffiDet for small object detection

5 Conclusion

Aiming at the contradiction between the restricted network and parameter scale of EfficientDet and the demand for high-precision detection, this paper integrates the proposed CAFF module into EfficientNet and propose a bi-directional feature pyramid network, Multi-level cross-BiFPN, which can realize multi-tier and multi-node feature fusion learning, thus constructing the SpanEffiDet model. The experimental results demonstrate that SpanEffiDet achieves significant performance improvements in MS COCO, PASCAL VOC2007, and PASCAL VOC2012. Extensive research and visualization results substantiate the superiority of SpanEffiDet. We aspire that the proposed SpanEffiDet can become a robust foundational framework in future object detection tasks.