1 Introduction

Object detection is one of the most important tasks in the field of computer vision. It has attracted increasing attention for applications in areas such as autonomous driving [1]. With the development of deep learning, object detection has also incorporated many learning methods based on convolutional neural networks(CNN) [2, 3], in which the backbone of the detector usually consists of a large number of convolutional operations to achieve better feature extraction. In previous works, in pursuit of better detection results, object detection models became increasingly large in terms of the number of parameters and computational complexity, ignoring the real-time nature of object detection and making it difficult to deploy on low-computing devices such as mobile devices. Therefore, to reduce the complexity of the model, methods such as quantization [47] and pruning [810] can effectively reduce the size of the model and improve the speed of detection by pruning redundant connections in the network [11], although at the cost of some reduction in detection accuracy.

Knowledge distillation [12, 13] is an approach that can improve the accuracy of a model without changing the size of the network by learning the behavior of a more powerful network. When knowledge distillation was first proposed, it was used more for image classification tasks and less for object detection, mainly because the soft label output of the teacher network did not directly help the student network to further find the location of the target. Knowledge distillation was then proposed to pass the middle layer feature information from the teacher network to the student network to obtain the needed localization information for object detection. However, this learning process is inefficient, and the student network often does not learn all the knowledge of the teacher network. We therefore want to address the shortcomings of traditional distillation techniques by self-distillation [14, 15]. Instead of training a large teacher network, we extract valid information from the student network itself and let the student network be its own teacher, which reduces the computational cost of training.

The framework of the object detection model usually consists of a backbone network, a neck network and a detection head, where the backbone network is responsible for extracting the image feature information and the neck network usually combines different scales of feature map extraction information [16, 17], such as feature pyramid network (FPN) [18, 19] and path aggregation network (PAN) [20], to better fuse the semantic and positional information of the backbone and neck networks interactively. The final detection head is responsible for detecting the output feature map. In previous knowledge distillation methods used for object detection, the output feature maps of intermediate layers in the backbone network were typically distilled. This involved assigning a global weight to the information in the teacher network’s feature map, which the student network could then learn from. However, we believe that the information in the neck network is richer. The shallow network feature maps contain information at a larger scale, which is better suited for detecting small targets, while the deeper network feature maps are at a smaller scale and are better suited for detecting large targets. We perform distillation learning for three different scales of feature maps in the neck network simultaneously, and let the output part of the network learn feature information of the corresponding scale in its own middle layer. This approach can be more conducive to detecting multi-scale targets. Our method differs from other multi-scale detection modules [21] in that it does not change the network structure and only performs distillation operations within the original network structure, and aims to explore the feature information of different degrees of deep and shallow networks.

Moreover, knowledge distillation methods commonly used in the field of object detection distill and learn the entire feature map information, which can lead to errors in the process of determining the target position, making it difficult to accurately capture the target location information. Therefore, we want to learn more effectively about the local area of the target during the distillation process. In this case, we have added a Gaussian mask for assisted detection [22, 23], whose main purpose is to distinguish the foreground from the background region by first encoding the ground truth region of the target with Gaussian values and then setting the rest of the background region to 0 for processing. Finally, we calculate the mean square error loss between this encoded region and the output feature map of the model. The Gaussian mask we generate can match output feature maps at different scales through the feature adaptation layer, enabling its application in a variety of object detectors to improve detection accuracy.

In summary, this article has the following key contributions.

  1. 1)

    Our strategy eliminates the need to train a huge teacher model. Instead, a simple model needs to be trained as its own teacher. This approach eliminates significant training time and computational costs, and facilitates real-time processing and deployment on the device side.

  2. 2)

    We propose a multi-scale distillation scheme by distilling the feature maps at different scales in the neck network to extract their effective feature information and calculate the distillation loss between them and the output feature maps. This method facilitates the detection of multi-scale targets, where the accuracy of small target detection is also improved.

  3. 3)

    In addition, in order to achieve more accurate positioning of the target during distillation learning, we first generate a Gaussian mask to distinguish the foreground and background in the image processing stage, and calculate the mask loss between the detection stage and the output result to improve the detection accuracy of the model. These methods do not change the basic structure of the model, so they do not significantly increase the number of parameters.

2 Related works

Object detection, as one of the most important tasks in computer vision, aims to find the class and position of a given target in an image. In the last few years, object detection has evolved very rapidly. There are two main categories, one of which is two-stage detectors such as Faster-RCNN [24], Mask-RCNN [25], and Cascade-RCNN [26]. This type of algorithm usually generates candidate frames first and then fine-tunes the bounding boxes. The other class is the single-stage detectors represented by YOLO [2731], SSD [3234], FCOS [35] and RetinaNet [36, 37]. The single-stage algorithms do not generate candidate regions, but directly classify and localize the targets. Over time, both types of algorithms have been improving their model structures in order to improve detection efficiency. Although they are now equipped with richer network structures, the computational cost and network size of these algorithms are gradually increasing, making it difficult to meet the requirements of mobile deployment. Designing a lightweight backbone network of detectors [38, 39] has therefore become a research trend to speed up detection. In addition, there are a number of studies that aim to transfer knowledge from a large detector to a simple detector, which is also a research approach to improve the performance of small detection models.

Knowledge distillation (KD) has become one of the most effective techniques for compressing large models into smaller and faster models, and can improve its own detection accuracy by learning from the knowledge of large models compared to pruning and quantization techniques. The idea of knowledge distillation was first proposed by Bucila et al. [40] and popularized by Hinton et al. [41] to transfer knowledge from the teacher’s network to the student’s network through soft-labeled output. fitNets [42] showed that in addition to the loss of KD, the feature information in the middle layer of both networks could also be used to guide the student’s knowledge. However, the idea of knowledge distillation was more often applied to image classification tasks at that time, and subsequently the knowledge distillation approach was also widely used in the field of object detection, where Chen et al. [43] performed knowledge distillation from three parts: the backbone network, the neck features and the detection head. FGFI [44], on the other hand, instructed students by extracting fine-grained features in the foreground object region, leaving them with only ground truth neighborhoods. DeFeat [45] considered that the background region also contained useful information, so a decoupling of the foreground and background regions were used to transfer useful knowledge to the student network through the decoupling of neck features and the decoupling of classification heads, respectively.

With the rapid development of knowledge distillation, it was found that there was a limit to what a student network could learn from a teacher network, and that there were limits to the efficiency of that learning process. This was the reason why the idea of self-distillation was born. Students use only what they have learned inside the model to guide themselves without the guidance of a large teacher model. SAD [46] is a classic self-distillation framework that allows a network to use the attention map obtained from its own middle layer as its distillation target for lower layers, without proper labeling or additional supervision, that is, to perform distillation learning through top-down and layered attention maps within the network itself. DLB [47] is also a fast self-distillation framework, which mainly distills the soft targets generated from the previous iteration by half of each small batch. This method does not require additional runtime memory or modification of the model structure. As a relatively advanced self-distillation framework, LGD [48] enhances the relationship with the appearance of the target through label guidance and self-attention mechanism, thereby improving detection accuracy. FRSKD [49], on the other hand, is a self-distillation approach based on data augmentation (the network produces consistent predictions for targets of the same class of objects) and auxiliary network (using additional branches in the middle of the classifier network and guiding these branches to similar outputs through knowledge distillation), respectively. The approach uses soft labels and feature graphs for self-distillation, combining different depth feature layers for integration and refinement to guide the feature maps at the same level. Moreover, we find that the knowledge learned from feature maps at different scales can also help the network to improve its accuracy to address the detection needs of multi-scale targets.

3 Method

3.1 Multi-scale distillation loss

In this section, we describe our proposed multi-scale distillation framework in detail. As a classic object detector, you only look once (YOLO) incorporates a multi-scale detection structure in the model. Considering the volume and computational cost of the model, we choose the YOLOv5 network as the main framework for the experiment. Its main framework is shown in Fig. 1 and contains three main components: the backbone network, the neck network and the detection head.

Figure 1
figure 1

Overview of the proposed multi-scale self-distillation framework, which extracts feature maps at different scales from the feature pyramid network (FPN) structure of the neck network for shallow and deep layers, respectively, and distills them using sibling feature maps between feature layers of different depths. We then generate a Gaussian mask for the input image from the ground truth and calculate the mask loss between the output feature maps and the detection head. The yellow part is the feature adaptation layer of the Gaussian mask

As CNN-based detectors continue to evolve and the need for multi-scale target detection grows, different detectors have added modules to their networks that facilitate multi-scale target detection. Figure 2 demonstrates the network structure of YOLOv5, where the neck network is a combination of FPN+PAN modules. We use the feature layers of different scales of the FPN as teachers, and the deeper PAN output part as students. The scale of the feature map has changed three times. Since the feature map with a larger scale has a smaller downsampling rate compared with the original image, and the receptive field is smaller, we can detect some objects with smaller scales, and the smaller anchor is assigned. Therefore, we detect large targets (20 × 20) on small feature maps, medium-sized targets (40 × 40) on medium-sized feature maps, and small targets (80 × 80) on large feature maps. We calculate the distillation losses between feature maps of the same scale. The distillation losses of three different scales are calculated separately, and in order to better improve the detection accuracy of targets at different scales, we assign different weight coefficients to the three losses.

$$\begin{aligned}& L_{\mathrm{s}}=\frac{\gamma}{N} \sum_{h=1}^{H} \sum_{w=1}^{W} \sum _{c=1}^{C} \bigl(S_{(h,w,c)}^{\mathrm{s}} - T_{(h.w.c)}^{\mathrm{s}}\bigr)^{2} , \end{aligned}$$
(1)
$$\begin{aligned}& L_{\mathrm{m}}=\frac{1}{N} \sum_{h=1}^{H} \sum_{w=1}^{W} \sum _{c=1}^{C} \bigl(S_{(h,w,c)}^{\mathrm{m}} - T_{(h.w.c)}^{\mathrm{m}}\bigr)^{2} , \end{aligned}$$
(2)
$$\begin{aligned}& L_{\mathrm{l}}=\frac{1}{N} \sum_{h=1}^{H} \sum_{w=1}^{W} \sum _{c=1}^{C} \bigl(S_{(h,w,c)}^{\mathrm{l}} - T_{(h.w.c)}^{\mathrm{l}}\bigr)^{2} . \end{aligned}$$
(3)
Figure 2
figure 2

YOLOv5 network structure diagram and distillation layer indicator diagram, where SPPF refers to the spatial pyramid pooling fast module

The above equations show the calculation process of distillation loss for three different scales of targets, where \(L_{\mathrm{s}}\), \(L_{ \mathrm{m}}\) and \(L_{\mathrm{l}}\) represent distillation loss for small, medium and large scale targets, respectively, \(S_{(h,w,c)}^{k}\) and \(T_{(h,w,c)}^{k}\) (\(k=\mathrm{s},\mathrm{m},\mathrm{l}\)) are the feature maps of students and teachers corresponding to different size scales, and \({N=HWC}\) represents the total number of elements. Meanwhile, small target detection has been a major difficulty in the study. We add a weighting factor γ to the small target loss to balance the distillation loss scale and we can adjust the γ factor in the experiment. We then weight the three losses to obtain the total distillation loss \(L_{\mathrm{distill}}\) that is displayed in Eq. (4), where k indicates the number of feature maps.

$$ L_{\mathrm{distill}}=\frac{1}{k} (L_{\mathrm{s}} + L_{\mathrm{m}} + L_{ \mathrm{l}}). $$
(4)

3.2 Mask assisted detection

To improve the target localization efficiency and detection accuracy, we design a mask-assisted detection method that focuses on the output feature map of the network. We find that features in the central region of the target can be better generalized to the model, so we introduce a Gaussian mask to highlight the ground pixel features of the target region and suppress the surrounding background region when the image is input to the network. Specifically, assuming that the true frame region of the target is B, the size is W and H, and the center coordinates are \({(x_{\mathrm{c}},y_{\mathrm{c}})}\). The Gaussian mask is defined as follows in Eq. (5).

$$ {M_{(x,y)}} = \mathrm{e}^{- \frac{(x-x_{\mathrm{c}})^{2}}{\sigma _{x}^{2} (w/2)^{2}} - \frac{(y-y_{\mathrm{c}})^{2}}{\sigma _{y}^{2} (h/2)^{2}}} ,\quad (x,y)\in B. $$
(5)

The current pixel point coordinates are denoted as \((x,y)=0\) when the pixel coordinates do not fall within the groundtruth region. Where \(\sigma _{x}^{2}\) and \(\sigma _{y}^{2}\) represent the decay factors for the coordinates in both directions, we set \(\sigma _{x}^{2} = \sigma _{y}^{2}\) for ease of calculation. This mask is only valid within the target truth frame and is equal to 0 in all the other regions, so we hope that the mask will help the network to focus more on the foreground region. The visual results of a Gaussian mask are displayed in Fig. 3.

Figure 3
figure 3

Visual results of Gaussian mask generation on the KITTI dataset

Since the mask that we generate is based on the size of the input image, and the size of the feature map and the number of channels may be different after the network processing, we need to add a feature adaptation layer so that the mask corresponds to the size of the feature map and the number of channels. The structure of this feature adaptation layer is relatively simple, consisting of a convolutional transformer layer and a ReLU activation layer. The mask assisted detection flowchart is demonstrated in Fig. 4.

Figure 4
figure 4

Matching Gaussian masks generated from input images with feature maps at different scales. Among them, the feature adaptation layer can more easily calculate the mask loss

After passing through the feature adaptation layer, the Gaussian mask has the same size and channel as the output feature map, and then the mask loss \(L_{\mathrm{mask}}\) between them is calculated and continuously minimized by training the network:

$$ {L_{\mathrm{mask}}}=\frac{1}{N} \sum_{h=1}^{H} \sum_{w=1}^{W} \sum _{c=1}^{C} \bigl(M_{(h,w,c)}^{\mathrm{adap}} - F_{(h,w,c)}\bigr)^{2} , $$
(6)

where \(N=HWC\) is the total number of pixels, \(M_{(h,w,c)}^{\mathrm{adap}}\) is the mask adjusted by the feature adaptation layer and \(F_{(h,w,c)}\) is the output feature map. Mask-assisted detection helps the output feature map to better highlight information about the target and suppress information in the background region, improving detection results.

Combining the multi-scale distillation losses \(L_{\mathrm{distill}}\) introduced previously with the losses \(L_{\mathrm{gt}}\) generated by the training of detector, we define the total losses L used by the algorithm in this paper as a weighted calculation of the three losses, as shown in Eq. (7).

$$ {L} = L_{\mathrm{gt}} + L_{\mathrm{distill}} + L_{\mathrm{mask}}. $$
(7)

4 Experiments

4.1 Experimental setup

To validate the effectiveness of our method, we first conduct experiments on a state-of-the-art single-stage detector, YOLOv5, using the KITTI dataset. The KITTI dataset is currently the largest computer vision evaluation dataset in the world for autonomous driving scenarios, including real image data collected from urban, rural and highway scenes. Each image can contain up to 15 vehicles and 30 pedestrians, meeting the needs of multi-scale and multi-objective detection. It includes 7481 training set images and 7518 test set images and mainly detects three types of targets: vehicles, pedestrians and cyclists. To validate the generalization of our method, we also apply the improved method to different detectors for comparison experiments, where the evaluation metric is the average accuracy, i.e. \(\Delta mAP\), \(AP_{50}\), \(AP_{75}\), \(AP_{\mathrm{s}}\), \(AP_{\mathrm{m}}\) and \(AP_{\mathrm{l}}\), with the last three evaluating the accuracy of different scales of targets.

All experiments are conducted in Windows 11, CUDA 11.2 environment, GPU configuration: NVIDIA RTX 2080ti, and PyTorch is used as the main framework. The number of training iterations for all experiments is set to 200 epochs, the processing size per batch is set to 16, the learning rate decay strategy is cosine annealing, and the initial learning rate is set to 0.01 and the cycle learning rate is 0.1.

4.2 Experimental results and analysis

YOLO, as a classic object detector, has released multiple versions in recent years. To select the most suitable model for our method, we conduct comparative experiments on different versions of YOLO models. The experimental results are depicted in Table 1. The experimental results show that our method has improved performance on different versions of YOLO models. Among them, YOLOv3 and YOLOv4 network models have too large structures, which result in low accuracy. Although YOLOv7 has high detection accuracy, it also has high requirements for memory usage. Therefore, the experimental framework of this article chooses a more balanced YOLOv5 model.

Table 1 Comparative experiments on the parameters of different versions of the YOLO model

We conduct ablation experiments using small(s) and medium(m) versions of YOLOv5, respectively, and the experimental results show that both mask assisted detection(MAD) and multi-scale self-distillation(MSSD) improve the detection of the model, with the MSSD method improving the detection of multi-scale targets in images more significantly. The results of the module ablation experiment are presented in Table 2. When the model is chosen as the version of YOLOv5s, MSSD could help small target accuracy \(AP_{\mathrm{s}}\) improved by 3.9%, and the \(\Delta mAP\) of the model improved by 2.8% after combining the two improved methods. The network training process before and after using MSSD is shown in Fig. 5.

Figure 5
figure 5

Comparison of training processes before and after the improvement of different versions of YOLOv5. The left sub-figure shows the convergence comparison results of the training losses, and the right sub-figure shows the comparison results of the \(AP_{50}\) training processes

Table 2 Results of ablation experiments with different versions of YOLOv5 using the improved method. A tick in the box indicates that the method was used

It is worth noting the change in the number of model parameters and the amount of computation. Our improved method does not increase the overall computational pressure on the model, except for the feature adaptation layer, which increases the number of parameters, but the rest of the work is computed outside the network framework and does not increase the number of parameters or the amount of computation.

In addition, we have also conducted ablation experiments with our improved method on single stage detectors such as YOLOX and FCOS, adding MAD and MSSD methods to YOLOX-s and FCOS networks, respectively. The experimental results are demonstrated in Table 3. Experimental results show that our method can effectively improve the detection accuracy of the model, with YOLOX’s \(\Delta mAP\) improving by 2.1% and FCOS’s \(\Delta mAP\) improving by 1.8%.

Table 3 Comparison of ablation experiments using our method with single stage detectors of different frameworks(YOLOX and FCOS). A tick in the box indicates that the method was used

We also compare our method with some classical methods used for knowledge distillation on object detection, where KD, fine-grained feature imitation (FGFI) and distilling object detectors via decoupled features (DeFeat) are teacher-based methods, and we use YOLOv5s as the student model with two backbones, CSPDarkNet and ConvNext. The teacher model is set to YOLOv5m. Label-guided self-distillation (LGD) is the recently proposed teacherless distillation method. Our comparative experiment replaces the backbone of all models with CSPDarkNet. The experimental results are displayed in Table 4. It can be found that the accuracy of LGD is superior to that of FGFI and DeFeat by 0.6% and 0.2%, respectively, while the accuracy of our method is the same as that of LGD at 39.4%. When ConvNext is chosen as backbone, the detection accuracy outperforms CSPDarkNet, with our method outperforming LGD by 0.2%. However, ConvNext has a larger number of module parameters and a slower training speed, which may not be conducive to model lightweighting and real-time target detection.

Table 4 Experimental results comparing the YOLOv5s model with different knowledge distillation methods using CSPDarkNet and ConvNext as the backbone, where the evaluation metric is \(mAP\) (%)

The detection of multi-scale targets has been the focus of research in this field, with the detection of small targets being one of the difficult areas. We propose a multi-scale self-distillation method that helps to improve the detection of multi-scale targets. As shown in Eq. (1), we can find that decreasing γ can reduce the small target loss \(L_{\mathrm{s}}\) accordingly and optimize this loss more effectively, where the γ parameter regulates the scale of small target loss. Table 5 demonstrates the results of the parameter comparison experiments. \(\gamma =1\) means that the loss weights of the three scales are equal, while γ is too small to be 0.001, which can lead to local optimization of target loss and make it difficult to train the model more effectively. The small target accuracy \(AP_{\mathrm{s}}\) improves the most when γ is set to 0.05.

Table 5 Comparison of experimental results for hyperparametric tuning experiments, where the parameters of γ are adjusted in the multi-scale self-distillation (MSSD) method

Our proposed mask-assisted detection method focuses on highlighting features in the foreground region by changing the discrepancy between the foreground and the background. The experimental results are shown in Table 6. The Gaussian mask is generated by Eq. (5), where the parameters \(\sigma _{x}^{2}\) and \(\sigma _{y}^{2}\) adjust the influence range of the mask, and we set \(\sigma _{x}^{2} = \sigma _{y}^{2}\) for convenience. If the parameter is larger, the Gaussian mask will be scattered toward the boundary of the ground truth, and if the parameter is smaller, the Gaussian mask will be more concentrated in the central region of the ground truth. When \(\sigma _{x}^{2} = \sigma _{y}^{2} = +\infty \) is set, the Gaussian mask actually becomes a binary mask based on the ground truth. To verify the validity of the Gaussian mask, we conducted an experiment by adjusting different values of \(\sigma _{x}^{2}\) and \(\sigma _{y}^{2}\). The results show that the Gaussian mask assists best when \(\sigma _{x}^{2} = \sigma _{y}^{2} = 2\).

Table 6 Hyperparametric tuning experiments for Gaussian masks using the experimental model YOLOv5s+MAD

To verify the generalization of our method, we apply our proposed method to different object detectors to test the effectiveness. The single-stage detectors such as YOLOX, RetinaNet and FCOS contain multi-scale detection structures, so we add MSSD to these detectors. In the FPN structure of the two-stage detector, there are significant changes in the scale of the feature map and large convolutional kernels, resulting in a large overall parameter and computational complexity of the network. Therefore, it is necessary to change the original network structure in order to generate a distillation layer. Although this will improve the detection accuracy to a certain extent, it will also increase the computational cost of the network and slow the detection speed. The experimental results are depicted in Table 7, where it can be seen that the accuracy of the detectors was improved after the addition of MSSD, with the \(\Delta mAP\) of RetinaNet and FCOS improving by 1.1% and 2.1%, respectively. YOLOv5s has the most obvious improvement effect, with a 2.8% increase in \(\Delta mAP\).

Table 7 Comparison of experimental results of MSSD applied to different detectors

A comparison of our test results on the KITTI dataset is shown in Fig. 6, which focuses on the detection of car, pedestrian and cyclist targets. Generally speaking, cars occupy more anchors than pedestrians and cyclists at the same distance, so many car targets in the image are large, while many cyclist and pedestrian targets are small and medium-sized, which makes the detection process more difficult. Our method can effectively improve the detection of small and medium-sized targets, solving the problem of missed detection and false detection caused by target occlusion in the long-range view.

Figure 6
figure 6

Comparison of multi-scale self-detection(MSSD) detection before and after using the KITTI dataset, with baseline detection results on the left and MSSD detection results on the right

5 Conclusion

In this paper, we propose a novel self-distillation framework, called MSSD, which is mainly used for knowledge distillation of multi-scale targets. It targets the multi-scale detection module structures such as FPN and PAN. In the network structure, we use shallow networks as teachers and deep networks as students to extract information from feature maps of different scales, calculate corresponding multi-scale target losses, and perform distillation. Among them, the small target loss is optimized to effectively improve the detection accuracy of small targets. In addition we add a Gaussian mask based on the real frame of the target to mask assisted detection, which can suppress the background region information to highlight the feature information of the target during detection. Our approach is computationally inexpensive without the guidance of a large teacher model. Our approach demonstrates good performance compared to other methods and can be applied to different object detectors.