SPCS: a spatial pyramid convolutional shuffle module for YOLO to detect occluded object

In crowded scenes, one of the most important issues is that heavily overlapped objects are hardly distinguished from each other since most of their pixels are shared and the visible pixels of the occluded objects, which are used to represent their features, are limited. In this paper, a spatial pyramid convolutional shuffle (SPCS) module is proposed to extract refined information from the limited visible pixels of the occluded objects and generate distinguishable representations for the heavily overlapped objects. We adopt four convolutional kernels with different sizes and dilation rates at each location in the pyramid features and adjacently recombine their fused outputs spatially using a pixel shuffle module. In this way, four distinguishable instance predictions corresponding different convolutional kernels can be produced for each location in the pyramid feature. In addition, multiple convolutional operations with different kernel sizes and dilation rates at the same location can generate refined information for the corresponding regions, which is helpful to extract features for the occluded objects from their limited visible pixels. Extensive experimental results demonstrate that SPCS module can effectively boost the performance in crowded human detection. YOLO detector with SPCS module achieves 94.11% AP, 41.75% MR, 97.75% Recall on CrowdHuman, 93.04% AP, and 98.45% Recall on WiderPerson, which are the best compared with previous state-of-the-art models.


Introduction
Object detection is a basic and practical task in computer vision. In recent years, depending on the development of convolutional neural networks (CNNs), researchers have seen broad prospects of utilizing detection technique in various domains, such as pedestrian and vehicle detection in auto-B Haibo Luo luohb@sia.cn 1 matic drives, remote object recognition [1,2] and intelligence surveillance systems [3]. Many CNN-based detectors have been proposed such as YOLO series [4][5][6][7], SSD [8], DSSD [9], Faster-RCNN [10], CenterNet [11], FCOS [12], which are all proved to have state-of-the-art (SOTA) performance on general object detection benchmarks such as COCO [13] and Pascal VOC [14]. However, for all the mentioned models, there are still room for improvement when the objects crowdedly occur and overlap each other heavily. There are two main challenges in that situation: (1) heavily overlapped objects are hardly distinguished in the semantic feature space, because most of their pixels are shared and the visible pixels of the occluded object which are used to represent its particularity are limited; (2) the traditional greedy non-maximum suppression (NMS) process will suppress the heavily overlapped prediction boxes by mistake when their overlapping degree greater than a specific threshold. These two challenges make the current models unable maximize their potential.
To date, some works have been proposed to improve the detection performance in crowded scenes [15][16][17][18][19][20]. Other works pay attention to the NMS process [21][22][23]. To the best Fig. 1 a Since each grid can extract only one object whose center point locates in it, heavily overlapped objects are easily ignored in original YOLO. b Fine meshing will make the center of objects locate in different grid as possible, so the heavily occluded objects can be preserved into the training process. c The visible pixels of the occluded objects are limited, as shown in the yellow region, which makes the high level semantic feature of the occluded objects are hardly distinguishable from the ones who cover them of our knowledge, most works that specifically pay attention to the occlusion issue are based on two-stage detectors. Compared with two-stage models, one-stage detectors have many obvious advantages. YOLO series as representative models among SOTA detectors, have a good balance between precision and inference speed, so that they have serviced in industries extensively. Many works, such as scaled-YOLOv4 [24] and YOLOX [25], improve the YOLO detector in terms of the structures, image augmentation, training methods and so on, and they all achieve impressive progress on COCO dataset compared with the original YOLO, which demonstrates that the YOLO detectors still have potential to perform better. However, for YOLO-based detectors, there is another shortcoming towards the crowded object detection. The original YOLO separates each pyramid feature into several grids (e.g., 13 × 13, 26 × 26 and 52 × 52 grids with input size of 416 × 416), and each grid is assigned only one ground truth box whose center point is located in it. When two or more objects overlap with each other heavily and their center points are located in the same grid, only one of these objects will be reserved into the training process and the others will be ignored, as shown in Fig. 1a.
In this paper, we propose a spatial pyramid convolutional shuffle module named SPCS for the YOLO detectors to handle the crowded scenes. The YOLO-based detector with SPCS module is named as YOLOC. SPCS module enlarges the pyramid features by fusing the outputs of four convolutional layers with different kernel sizes via pixel shuffle module [26]. There are two steps in SPCS module: First, for each grid in YOLO pyramid feature, the spatial pyramid convolutional (SPC) module generates four distinguishable sub-features extracted using four convolutional kernels with different sizes and dilation rates. In this way, the distinguishable representations can be generated for the multiple overlapped objects that occupy almost the same regions. Then, the four output features with same channels are concatenated in the channel wise, and a pixel-shuffle module is adopted to increase the resolution of feature pyramid, i.e., adjacently place the sub-features which are extracted from same location spatially. The SPCS module cannot only increase the resolution of the feature pyramid, which can cover the YOLO's shortage in positive target determination mechanism when facing the crowded scenes, as shown in Fig. 1b, but also provide distinguishable features for heavily overlapped targets. To verify the ability of SPCS module in occluded object detection and prevent the influence of the NMS post-process, we adopt three NMS methods, i.e., greedy NMS, Adaptive NMS [22] and Soft NMS [27], to comprehensively show the performance of SPCS module. It is worth noting that the predicted density information of objects is required as the Intersection over Union (IoU) threshold in the Adaptive NMS algorithm. Compared with extracting information from a single object, the density prediction needs to extract information from multiple overlapped objects, and a strong information extraction ability and larger receptive fields are necessary [22]. Therefore, the performance of the predicted density in Adaptive NMS can be used as a metric to measure the ability to extract information from occluded objects. In this paper, we design an experiment on the density prediction to demonstrate that the proposed SPCS module can improve information extraction ability. We adopt a tiny density prediction head to predict density of objects, which is different from Adaptive NMS [22], to make the influence of SPCS prominent and prevent the extra complex networks from covering up the shortage in terms of density prediction.
Extensive experiments are implemented to verify the effectiveness of the SPCS module. First, ablation studies are implemented on the CrowdHuman [28] dataset to verify the comprehensive effectiveness of the SPCS module in occluded object detection. Second, a density prediction experiment is conducted to demonstrate that the SPCS module can improve the information extraction ability. Third, comparative experiments are conducted on the CrowdHuman and WiderPerson [29] to compare the comprehensive performance of our model with some SOTA models. The results show that YOLOC achieves the best performance in AP and Recall, and the second best performance in MR on Crowd-Human, i.e., 94.11%, 97.75% and 41.77%, respectively. On WiderPerson, YOLOC achieves 93.04% AP, 50.71% MR and 98.45% Recall. Moreover, benefitting from its onestage structure, YOLOC achieves the fastest inference speed among all SOTA models.
For clarity, the main contributions of this paper can be summarized as follows: 1. A spatial pyramid convolutional shuffle module is proposed to boost the ability of extracting information from the limited visible pixels of occluded objects and generating distinguishable representations for the occluded objects. 2. A tiny density prediction head and a density loss function are proposed for the density prediction experiment which are designed to prove that the SPCS module can improve information extraction ability.

Extensive comparative experiments are conducted on
CrowdHuman and WiderPerson to prove that models with SPCS module can achieve best performance in heavily occluded object detection.

Related work
Generic object detection Object detection as a fundamental computer vision task, has achieved great progress with the rapidly development of the convolutional neural network. The mainstream detection models are usually categorized into two-stage models [10,[30][31][32][33][34] and one-stage [4][5][6][7][8][9]35] models. RCNN [30] first adopt CNNs into object detection and proposes a two-stage framework: first generate proposal boxes using selective search algorithm [36], then conduct box regression and classification based on the proposal boxes obtained in the first stage to get refined predictions. To solve the problem that the computations between different object prediction cannot be shared, Fast RCNN [31] proposes RoI (region of interest) pooling to make the output from each proposal has the same size, which increases the inference speed significantly. Faster-RCNN [10] lays the foundation of the two-stage detectors, proposes a RPN (region proposal network) to replace the selective research algorithm, and filter the background regions and effectively generates precise proposal regions. Some other works such as RoI Align [32], RoI warping pooling [37], PrRoIPooling [38], and PSRoI pooling [34] pay attention to the pooling process to the region of interesting. Although the two-stage methods achieve impressive precision, the inference speed of two-stage methods are always not satisfactory. Different from the two-stage detectors, the one-stage methods replace the predicted proposal boxes by the fixed anchor boxes which are densely paved in the prediction features, and conduct regression and classification based on the anchor boxes in the fully convolutional way. SSD [8] proposes a one-stage framework and utilize the multi-scale features to detect objects with different scales. DSSD [9] fuses the deep and the shallow features to enrich the semantic information of the features with large resolution, which is helpful for the small target detection. RetinaNet [35] proposes a Focal Loss for the issue that the one-stage detectors suffer from extreme imbalance between the positive and negative samples. YOLOv3 [6] proposes a full convolution network DarkNet as the backbone achieves great balance between speed and precision. Later, YOLOv4 [7], Scale-YOLOv4 [24] and YOLOX [25] are proposed to optimize the YOLO model in terms of the network structures, image augmentation, training method. To deal with the scale changing problem, the aforementioned detectors are all based on the anchor boxes which are of various scales and shapes and densely paved on the feature maps. To eliminate the influence of the hyper-parameters brought by the anchor boxes and speed reduction caused by the non-maximum suppression post-processing, the anchor-free methods are proposed. CornerNet [39] proposes a key point-based method that predict the top-left and the bottom-right point of the object box. CenterNet [11] is also a key point-based method that predicts the center point of the bounding box as the key point. FCOS [12] proposes a full convolution anchor-free model and utilizes multi-scale features to solve the ambiguity when objects overlap with each other. Since the computation of one-stage detectors are shared between all the targets in an image, the one-stage detectors have great advantages in the terms of inference speed than the two-stage methods.
Works for crowded scenes Although the generic detectors have achieved great performance, the crowded scenes are still challenging for them, and many works have been proposed to dig out their potential in crowded target detection. [15] proposes a novel concept that each proposal box predicts multiple targets rather than one to solve the problem of feature confusion between the heavily overlapped objects. In addition, a set-NMS is proposed that prediction boxes that are generated based on the same proposal box should be preserved, and others should be suppressed. [16] follows the iterative scheme to detect a subset of objects at each iteration and there are no interactions between the detection results of different iterations. This method needs to conduct more than one inference framework in one detection process, which is obviously inefficient. [17] proposes a multi-scale attention feature aggregation module that can extract deeper information, and an attention block is added to enhance the features of objects. In addition, many works focus on the improving the NMS process. [20] develops a double anchor RPN to capture the body and head parts in pairs, which are used to guide the NMS process. This method uses both head and body information, however, not all instances in various datasets are labeled in the head-body pair way. Different from the [20] which predicts the head-body box pair, the [21,23], which work in very similar ways, predict visible-full box pairs. It is obvious that the visible parts of objects in the crowded scenes rarely overlap and there is a correspondence between each visible box and the full box. Thus, the visible boxes that are preserved after the NMS process can be used to guide the selection of full prediction boxes. However, only the datasets that labeled in that certain way can be trained using this method. [22] finds that the fixed IoU threshold is not reasonable for the crowded scenes and claims that the IoU threshold should change according to the density of the counterpart object, i.e., increase when objects are dense and decrease when sparse. To solve the occlusion problem in pedestrian detection, [40] adopts a channel-wise attention mechanism into the Faster-RCNN to handle different occlusion patterns. It find that some specific channels show strong activations at the human head, upper body and feet, respectively. Guided by the difference of the activations from different regions, the attention mechanism will reweight each channel and make the occluded parts have lower impact on the final score. [18] proposes AggLoss, which is also adopted by [22], to make the proposal boxes corresponding to the same object more compact. In addition, a new part occlusion aware region of interest (PORoI) pooling is utilized to integrate the prior structure information of the human body with visibility prediction into the network.
The aforementioned algorithms have achieved great performance in crowded pedestrian detection. However, to the best of our knowledge, most works that specifically pay attention to the occlusion issue are based on two-stage detectors such as Faster-RCNN, which does not have as a great balance between precision and inference speed as the one-stage models. In this paper, we propose a one-stage detector that achieves SOTA performance in terms of precision and is significantly faster than the current two-stage models.

Methods
In crowded scenes, there are many objects occluded heavily by other objects, and the pixels that represent their specificity are limited, which makes them hardly to be distinguished from the objects that cover them. In this section, we introduce the SPCS module to generate refined distinguishable features for heavily occluded objects.

SPCS module
YOLO's mechanism to determinate the positive anchor is not friendly to the crowded targets. In the YOLO algorithm, the feature pyramids are divided into several grids spatially (e.g., 13 × 13, 26 × 26 and 52 × 52 with input sizes of 416 × 416), and only the grid that contains the center points of objects will be seen as positive. However, when more than one targets that heavily overlaps each other occurs and their center points are located in the same grid, only one target will Fig. 2 The spatial pyramid convolutional shuffle module consists of 4 different convolutional kernels. The convolutional kernels with different sizes and dilation ratios cover different scopes, and their outputs are shuffled to generate distinguishable representations for heavily over-lapped objects. Except the end convolutional layers in the prediction head, all other single Conv module is consist of one convolutional layers, one BatchNorm layer and one Mish [41] activation layer be preserved in the training process, and the others will be ignored. This shortcoming makes the target that heavily overlaps another one hardly to be predicted by YOLO. Increasing the resolution of the feature pyramid is a good way to cover this problem. Fine meshing will let the targets be assigned to different grids as much as possible, as shown in Fig. 1b. However, there is another problem: for an object that heavily covered by another front object, the front object occupies most of the region of its bounding box, and its visible pixels which are used to express its particularity are limited, as shown in Fig. 1c. Therefore, for the occluded object, it is difficult to extract a distinguishable representation which is far away from the front object in the feature space.
In this paper, we propose a spatial pyramid convolutional shuffle module to increase the resolution of the pyramid feature and generate refined distinguishable representations for heavily overlapped targets at the same time. As shown in Fig. 2, the SPCS takes pyramid features as input. Inspired by the spatial pyramid pooling (SPP) [42], we use four convolutional kernels with different sizes and dilation rates to each pyramid feature, and concatenate the outputs of these four convolutional layers in channel wise. Specifically, we adopt four kinds of convolutional kernels, i.e., kernel size 1 × 1, kernel size 3 × 3 with dilation rate 1, kernel size 4 × 4 with dilation rate 2 and kernel size 5 × 5 with dilation rate 2. Different kernels cover different spatial scopes. Compared with a single 3 × 3 convolutional layer, we look one location four times through four different kernels. This hierarchical structure extracts information from different scopes to form refined features with not only detailed but also relatively global information of the current region. Then, a pixel-shuffle module is utilized to recombine the features to increase the resolution. In this way, 4 distinguishable sub-features which correspond to 4 different convolutional kernels can be generated based on the same grid of the original feature pyramid.
As shown in Fig. 3, we name each 2×2 grid in the enlarged feature as a cell, and the feature with the resolution 2W ×2H contains W × H cells. Given an input feature map with size 4C ×W × H , where W is width, H is height and 4C is number of channels, the pixel-shuffle module divides the input feature into four parts in the channel wise averagely, and each part is of the size C × W × H . The grids in the first part (the green feature in Fig. 3) are resettled in the top-left grid of each cell in the output feature of the pixel-shuffle module, the grids in the second part (the blue feature map in Fig. 3) are resettled in the top-right grid of all cells, the third part (the red feature map in Fig. 3) in the bottom-left and the fourth part (the yellow feature map in Fig. 3) in the bottom-right grid of all cells. This mechanism makes the grids in every cell have a fixed relationship to the previous convolutional kernel sizes, or in other words, receptive fields. However, this fixed relationship may not be the best choice, i.e., the top-left grids may need larger receptive fields than other grids in the same Fig. 3 Pixel shuffle module. The input feature map will be divided as four parts in channel wise like the yellow, red, green and blue subfeatures, and each part is resettled adjacently to form an enlarged feature map. The location of each sub-feature in the enlarged new feature is fixed cell, or the bottom-right grid may need to focus on a small region. Therefore, we add a 1 × 1 convolutional layer before the pixel-shuffle module which does not change the channel size of the feature to fuse the results of the four convolutional layers in channel wise. Following the baseline, we use Mish [41] as the activation function in all network structures.
In the enlarged feature map, as shown in Fig. 3, the subfeatures in the four grids of the same cell are obtained based on the same grid of the previous feature map, which is similar to the idea of [15] that predicts multiple boxes based on one single proposal box. The SPCS module predicts 4 instances based on a same location in the original pyramid feature, but they are different from each other because of the difference between their corresponding convolutional kernel sizes. Multiple convolutional kernels with different receptive fields utilized on the same location can provide multilevel information.
In addition, to further enhance the dissimilarity between adjacent sub-features, so we add a skip branch to directly transmit the detail information to the enlarged features from the low level features with the same resolution in the backbone, as shown in Fig. 2. It is obvious that the differences between the overlapped objects are from the details, which means the low level features that contain more detail information can further enhance the difference between the adjacent sub-features. Following the YOLO style, the low level features are concatenated with the output features of SPCS module to form the new feature pyramid.
Benefitting from its structure, the SPCS module can provide four distinguishable representations for each location of the original pyramid, and the increased resolution covers the shortcoming of the positive anchor determination mechanism of YOLO in crowded scenes.

NMS process
There are two main challenges in heavily overlapped object detection, i.e., the distinguishable feature extraction issue and the NMS process. Even though the heavily overlapped objects are predicted correctly, the post-processing NMS may suppress some of them by mistake.
The original NMS, which is adopted by most of the SOTA algorithms and achieves great performance on general targets datasets such as COCO, cannot leisurely cope with the scenes where targets occur crowdedly. The traditional greedy NMS adopts a fixed IoU threshold and directly deletes the boxes whose IoU score with the proposal box greater than the threshold, which is definitely not reasonable for crowded objects. Soft NMS [27] improves the strategy to prune the real redundancy prediction boxes. It punishes the confidence scores of the candidate boxes according to their IoU value with the proposal box, i.e., the larger the IoU value with the proposal box, the smaller the confidence scores will be, and then suppresses all candidate boxes whose confidence scores are less than the threshold. The Adaptive NMS [22] provides a reasonable idea that the IoU threshold should be adaptive according to the density of the targets, i.e., the IoU threshold should increase when targets are dense (heavily overlap with each other) and decrease when targets are sparse (no touch or mildly overlap).
As mentioned above, there are two main challenges in heavily overlapped object detection, i.e., the distinguishable feature extraction issue and the NMS process. These two challenges can simultaneously influence the performance of detectors. Since our SPCS module is proposed for the first challenge, in this paper we adopt all these three NMS methods, respectively, aiming to alleviate the influence of NMS and comprehensively show the performance of the SPCS module. The related discussion will be released in the experimental section.

Experiments
To verify that the SPCS module can improve the performance in heavily overlapped object detection and boost the information extraction ability, we evaluate our model on two public datasets, i.e., CrowdHuman and WiderPerson. First, we implement ablation studies on the CrowdHuman dataset to verify the effectiveness of the SPCS module on occluded object detection. Second, a density prediction experiment is conducted to verify that the SPCS module can improve the information extraction ability. Finally, we perform comparative experiments on the CrowdHuman and WiderPerson datasets to compare the performance of YOLOC with some SOTA methods on crowded target detection.

Dataset
CrowdHuman CrowdHuman is a public human dataset that contains 15000 images in the training set, 4370 images in the validation set and 5000 images in test set. There are approximately 470K instances in the training set and the validation set, and each image contains 23 instances on average. Each instance has three labels, i.e., full box that surrounds the whole pedestrian including the occluded parts, visible box and the head box. In our method, we only use the full box labels. Compared with other pedestrian datasets, instances in CrowdHuman are more denser and there are on average 2.40 instances per image whose IoU with other instances greater than 0.5. The results evaluated on CrowdHuman is more convincing for verifying the ability of the crowded target detection, so we perform most of the ablations and comparisons on CrowdHuman. All results are reported on the validation set.
WiderPerson WiderPerson is a crowded human detection dataset. It contains 8000, 1000 and 4382 images in training set, validation set and test set, respectively, and each image contains 28.87 instances on average. The objects in this dataset are annotated for 5 classes: pedestrian, rider, partially visible persons, crowd and ignored region. Following the protocol of the official evaluation code, we only use annotations of the first category, i.e., pedestrians, for training and testing, and ignore all annotations of other categories. All results are reported on the validation set.

Evaluation metric
Average precision (AP) AP is the mainstream evaluation metric for object detection, which takes into account not only precision but also the recall ratio of the detection results. Larger AP means better performance.
MR MR is a metric commonly adopted by pedestrian detection. It is the short for long-average miss rate on false positive per image (FPPI) with the overlap thresholds range of [10 −2 , 10 0 ], which is the same as the official metric of Caltech [43]. The MR is very sensitive to the false-positive rate. Lower MR means better performance.
Recall Recall is the short for the maximum recall among all detection boxes, and this metric reflects the proportion of the predicted true positive to ground truth, i.e., how many ground truth objects can be predicted properly. It can be calculated as the following: Larger Recall means better performance.

Training settings
We train all the models using the SGD optimizer with momentum 0.937; warm up epochs are 3; all training images are resized to 864, and Mosaic and MixUp [7] are used for image augmentation. The larger edges of the testing images are resized to 896 without any image augmentation. Multiscale training and testing are not adopted. The cosine learning rate [44] scheduling strategy is adopted, which is defined as follows: where lr t is the learning rate of the epoch t, lr init is the initial learning rate, T is the number of epochs the model will be trained, and 1−η controls the lower limit of the learning rate. Similar to the original YOLO, we use the K-means algorithm to statistic 9 anchor boxes of the corresponding dataset, and in the training process, anchors are determined as positive or negative according to the following: where w a and h a are the width and height of the current anchor box, respectively, and w t and h t are the width and height of any one target bounding box, respectively. For the sake of fairness, all of our experiments use the same hyperparameters and image augmentation method. The results are obtained based on PyTorch framework 1.7.1 using 8 NVIDIA RTX 3090 GPUs. Table 1 shows the ablation experiments of the proposed SPCS module. The baseline is Scaled-YOLOv4 with three different NMS methods. For original NMS, we set IoU threshold as 0.6. For Adaptive NMS, for the sake of fairness, we use ground truth density which is calculated using the annotations of the validation dataset as the IoU threshold. We set the batch size to 32, the initial learning rate lr init of the cosine scheduling strategy to 0.005, η to 0.12, T to 300 and momentum as 0.937. The backbone and neck of the models are initialized using the pre-trained scaled-YOLOv4 on COCO and the rest are initialized with a Gaussian distribution with a mean value of 0 and variance of 0.2.

Ablation studies on CrowdHuman
We can see that compared with the original Scaled-YOLOv4, the SPCS module can significantly improve performance of the crowded object detection. Under the circumstances of the original NMS, the AP, MR and Recall increase 0.97%, 1.44% and 1.08%, respectively; under the circumstances of adaptive NMS with ground truth density, the AP, MR, and Recall increase 0.75%, 1.42% and 0.85%, respectively; Under the circumstances of soft NMS, the AP, MR, and Recall increase 0.62%, 1.43% and 0.63%, respectively. These comprehensive comparative results can exclude the influence of different NMS methods and demonstrate that the SPCS module can improve the performance in crowded scenes.
The increased resolution of the pyramid features can guarantee as many targets as possible to be preserved in the training process, and the distinguishable multilevel semantic information provided by the SPCS module can extract refined semantic information for occluded objects. These two properties can be reflected in terms of recall rate directly. Table 1 shows only the maximum recall rate, however, as we all know that the recall rates are different as the confidence score threshold changes, we sample recall rate values with an interval of 0.1 in the confidence score threshold range [0, 1], as shown in Fig. 4. Regardless of the NMS process, the recall rate generated by the model with the SPCS module outperforms the recall rate generated by the baseline. This benefits from the two properties mentioned above. In addition, the results generated by the Soft NMS achieves best Recall 99.01%, which means that only very few objects are missed. And the comparison between original NMS and the Soft NMS also demonstrate that the many objects are suppressed by NMS process rather than missed by the detector.

Ablation studies on information extraction ability
In this section, we use the adaptive NMS to design a density prediction experiment to demonstrate that our SPCS module can enhance the information extraction ability for the crowded object. The adaptive NMS utilizes the density information that is predicted by the network. It is clear that the density prediction needs to consider information of multiple overlapping objects, and the limited visible pixels of the occluded objects play an important role in the precise density prediction. If two objects overlap extremely heavily and only very few pixels of the occluded object are visible, their real density, i.e., the IoU value between their bounding boxes will tend to be 1. However, if these very few visible pixels are ignored by the detector, which means that the detector will think there is only one object there, the predicted density will tend to be 0, which is completely opposite to the truth and will produce great error in adaptive NMS. Therefore, the performance of the predicted density in adaptive NMS can be used to demonstrate the information extraction ability for the occluded objects indirectly. The performance variation caused by the predicted density and the ground truth density of the objects in Adaptive NMS algorithms can be used as a metric to show the information extraction ability.
The results closer to the ground truth density indicate better performance.
Prediction head In our method, to make the influence of the SPCS module prominent and prevent the extra complex structure from covering up the shortage of the original network in density prediction, a tiny density prediction branch is adopted. Our density prediction subnet contains only two convolutional layers, i.e., a 3×3 convolutional layer and a 5 × 5 convolutional layer, which is much simpler than the [22], as shown in Fig. 5b.   Fig. 5 a The density prediction head adopted in [22]; b the tiny density prediction head we adopted in our model. To make the information extraction ability of SPCS module prominent, we use a simple structure for density prediction to prevent that other extra complex structure will cover up the shortage in information extraction ability of the original model. Except the end convolutional layers which output the predictions, all other single Conv module is consist of one convolutional layers, one BatchNorm layer and one Mish activation layer Density loss In an image with multiple objects, each object may overlap with more than one other, and we choose the maximum IoU value as its density label. The object density is defined as follows: where td i is the density label of the box b i , and is defined as the maximum bounding box IoU with all other ground truth boxes in set ψ , iou(x, y) computes IoU value of the two input boxes x and y.
In the NMS process, a candidate box b i should be sup- where M is the current proposal box and t M is the adaptive IoU threshold of M. Following the Adaptive NMS, the t M is defined as  1 Prediction density generated by the density prediction head shown as Fig. 5(b) 2 The ground truth density is calculated using annotation information Bold values indicate better results than other methods under the current index where d t is the lower bound of the adaptive threshold and is set as 0.6 manually in our method, and the d M is the predicted density value of the predicted box M. The NMS process is summarized as following: for the current proposal box M and candidate box B: which means box M is located in a sparse region, the NMS process will follow the traditional process, i.e., the boxes whose IoU values greater than the fixed threshold d t will be suppressed and others will be preserved.

if d t < d M , which means box M is located in a crowded
region and the boxes whose IoU values greater than the M's density d M will be suppressed. This adaptive threshold will save the predicted box belonging to different objects even though they heavily overlap with each other.
Different from [22] which uses smooth L1 loss for density prediction, in our method, the focal loss is utilized in the training process of the density prediction, which is defined as where K is the width or height of the output feature of the SPCS, N is the number of anchor boxes at each grid, and 1 obj i j means that the loss function will penalize the corresponding density prediction error only if an object occurs in the grid whose index is (i, j). The γ is set as 0.2 in this paper. We train the scaled-YOLOv4 with the density prediction head 5(b) using the loss function L d shown as the equation (6) and use the results obtained through the Adaptive NMS as the baseline. To measure the performance of the predicted density, we test the model through Adaptive NMS, respectively, using the ground truth density and the predicted density. The difference between the testing results caused by the predicted density and the ground density can be seen as a metric to evaluate the performance of the predicted density. The results closer to the ground truth density indicate better performance. Table 2 with the ground truth density as the IoU threshold. The deviations are 0.2%, 0.31% and 0.09%, respectively. It is obvious that the density predicted after using the SPCS module is closer to the ground truth density. This phenomenon proves that the SPCS module is helpful in density prediction. As mentioned above, the density information prediction needs refined information of several overlapping objects, so the experimental results indirectly prove that the SPCS module is helpful in refined information extraction.

As shown in
To visually demonstrate the changes brought by the SPCS module, we visualize the pyramid features. The details are shown in Appendix A.

Discussion on parameters and inference speed
We also study the parameter increases and the time costs brought by the SPCS module. Since the testing results reported in Table 1 are all based on the YOLOC that is integrated with our tiny density prediction head as shown in Fig. 5(b), we only report the inference speed when the density prediction head is adopted, as shown in Table 3.
The tiny prediction head increases about 4.05 M parameters when the SPCS is not integrated and 4.43 M parameters when the SPCS module is involved. The difference is because that the channel number of the features that are input to the density prediction head are different when the SPCS module is adopted. In addition, the SPCS module increases about 21 M parameters. The difference when the density predicted is adopted and not adopted is also caused by the change of the input feature channels. The increases in parameters are mainly brought by the four dilation convolutions. We con-  1 The inference speed of CrowdDet is tested using its official code on the same platform as our YOLOC 2 The results generated by adaptive NMS method takes the predicted density as IoU threshold Bold values indicate better results than other methods under the current index catenate their output features in channel wise rather than add or multiply them in element-wise, which makes the channel numbers of the intermedia features of the SPCS are 4 times as much as before. However, this mechanism can guarantee the information be preserved as much as possible as shown in Table 1. In addition, the added parameters have little effect on the inference speed, our YOLOC can still achieve real-time performance.

Comparative experiment
In this section, we compare the performance of YOLOC with some current SOTA algorithms in terms of precision and inference speed on CrowdHuman and WiderPerson, respectively.
CrowdHuman The CrowdHuman is one of the most convincing datasets to test the model's ability to detect occluded objects. We compare our YOLOC with the newest SOTA method and the results are shown in Table 4. As shown in Table 4, our method achieves the best performance in AP and Recall, and the MR is the second best among all previous SOTA models. Among all one-stage models, YOLOC leads with a huge margin in detection performance.  Figure 9 shows the PR curves of YOLOC and CrowdDet which is currently the best detector on CrowdHuman. Other results in Fig. 9 are also generated by the official codes of [15]. It is clear that, regardless of which NMS method is used, our YOLOC achieves better performance compared Fig. 7 The red boxes are prediction boxes generated by our YOLOC, the green boxes are the ground truth used in the evaluation codes, and the yellow boxes are the heavily occluded objects which are predicted by YOLOC properly but will be seen as false-positive ones during the evaluation process Discussion about the bad performance in MR on Wider-Person As mentioned above, MR is extremely sensitive to the false-positive rate. By observing the ground truth annotations used by the official evaluation codes, we find that the possible reason why YOLOC performs poorly in MR is that YOLOC can detect many heavily occluded objects that will be seen as negative in the official evaluation codes. According to our observation, this phenomenon is very common in our test process of WiderPerson, as shown in Fig. 6. In WiderPerson's annotations referenced by the evaluation code, many Fig. 8 Qualitative results on CrowdHuman. The red boxes are prediction boxes generated by our YOLOC, the blue boxes are the prediction results of baseline, and the green boxes are the boxes missed by the baseline but detected by the YOLOC heavily occluded objects are ignored, which means that if these occluded objects are detected as positive, they will be seen as false positive, and the MR will increase. Our YOLOC has a strong power to detect not only the fully visible objects, but also the occluded objects that are of limited visible pixels, which is why we achieve the best performance in Recall. However, many occluded objects detected by YOLOC will be seen as false positives during the process the official evaluation process, which is very harmful to the metric MR. Different from the WiderPerson, the CrowdHuman dataset annotates all objects that occur in the image as possible, regardless of how many pixels of them are visible, as shown in Fig. 7. Therefore, YOLOC can achieve SOTA performance in MR on the CrowdHuman dataset.

Conclusion
In this paper, we propose a spatial pyramid convolutional shuffle module named SPCS for occluded object detection. Since it is difficult to extract distinguishable representations for heavily occluded objects, at each location of the pyramid features, we adopt multiple convolutional kernels with different receptive fields, and the output features are recombined spatially using a pixel-shuffle module to increase the resolution. In this way, four instance predictions can be generated based on each location of the pyramid feature, and each of them is distinguishable since they correspond four different convolutional kernels, respectively. Moreover, the multiple convolutional kernels with different receptive fields can extract refined information for each region, which is helpful for the detection of occluded objects whose visible pixels are limited. Extensive experimental results demonstrate the effectiveness of the SPCS module on occluded object detection.
Funding The authors did not receive support from any organization for the submitted work. Fig. 9 Visualized results of the ablation studies. From the left column to the right column are the input images labeled with the target bounding boxes, the visualized pyramid features generated by the SPCS module, and the visualized pyramid features from the original Scaled-YOLOv4. The results in three rows correspond to the pyramid features at three scales, respectively

Declarations
where F m ∈ R H ×W is the mean feature of F along the channel wise, max(x) and min(x) are the maximum and minimum value of the input x, respectively. As Fig. A1 presented, two changes were caused by the SPCS module. Firstly, the adjacent sub-features in the middle column features are distinguishable. In detail, each 2x2 grid (Red frame A) corresponds to the 1x1 grid (Red frame B), and the mean values of the four sub-features in frame A are unequal, which means two overlapped objects are distinguished if their center points are in different grids in frame A. Whereas, they will not be distinguished if their centers are located in the frame B. Second, the features generated by SPCS module have higher overall contrast, and compared with the background regions, the regions where the objects are located in are more prominent , which means that the information extraction ability is boosted by the SPCS module. This phenomenon also explains why the performance of the object density prediction is enhanced after the SPCS adopted in section 4.5.