Introduction

Object detection is crucial in automatic driving [1, 2], object tracking [3] and other computer vision tasks [4,5,6]. With the advancement of deep learning technologies, object detection has advanced significantly [7, 8].

In the previous 20 years, the anchor-based methods [9,10,11] have good detection accuracy in object detection tasks [12]. However, the anchor-based methods face some challenges. Anchor are boxes of various sizes and aspect ratios that serve as detection objects [13,14,15]. The shape of objects [16, 17] in a digital image is randomly indeterminate. Artificially designed anchor cannot be well applied to objects with different shapes. Meanwhile, anchor-based object detection methods also have the problem of mismatching positive and negative samples [12, 13]. To alleviate these challenges faced by anchor, key points based methods are proposed [13]. CornerNet [13] turns the traditional idea of predicting anchor boxes into a way of predicting the top left and bottom right corners of a box. But CornerNet [13] is sensitive to detecting the boundary of objects, meanwhile not being aware of which pairs of keypoints should be grouped into objects [12, 13]. CenterNet [18] was proposed to solve this problem. CenterNet [18] is a center point based object detection method.

As we all know, the detection accuracy of small scale objects in digital images is significantly lower than that of medium-scale objects and large-scale objects in some sense. This is called an imbalance problem in object detection. To alleviate the imbalance problem in detecting different scale objects, many methods have been put out, such as enhancing object feature information, improving training strategies, data expansion and so on [19]. These methods have better detection results [20,21,22]. Aiming at the phenomenon of sparse fault samples in actual industrial environments, a method of fault diagnosis with few samples based on parameter optimization and feature metrics is proposed [23]. A method [24] that can enhance the ability of cross scale detection and focus on valuable areas has been proposed, which has significant advantages in improving the detection performance of small scale objects. These methods [23,24,25] have high research value to inspire us. In this paper, we propose a brand-new multiple space based cascaded center point network (MSCCPNet) to alleviate the object missing problem of the single-center-point based network. Object detection experiments show MSCCPNet is effective and outperforms the single-center-point based network. The following are this paper’s main contributions:

  1. 1.

    We propose a novel multiple space based cascaded center point network (MSCCPNet) for object detection. The network can extract the features in different spaces and channels, and enhance object detecting efficiency by cascaded center point.

  2. 2.

    We propose a novel structure named SCAN structure by expanding the number of channels, reducing the width and height of features, and performing pooling operation. SCAN structure can scan more objects in different scale spaces. So it can alleviate the imbalance problem in detecting different scale objects.

  3. 3

    . We propose a cascaded center point structure. Cascaded center point structure predicts the feature maps of two different sequent spaces by integrating the results of the two centers with the idea of choosing the high confidence and discarding the low confidence. It outperforms original single-center-point based network.

  4. 4.

    Experimental results show that our proposed MSCCPNet has satisfactory performance compared with other classical object detection algorithms. MSCCPNet has the highest detection accuracy among multiple object categories.

The rest of this paper is organized as follows. The following section presents the related work. The subsequent section introduces the proposed method. Experimental results are presented in “Experiments”. Finally, a succinct conclusion is provided in “Conclusion”.

Related work

The categories of anchor-based methods and key points based methods can be used to categorize deep learning-based object detection algorithms.

R-CNN [11] is an earlier method to deal with object detection tasks using anchors. R-CNN [11] combines region proposals with convolutional neural networks to play an excellent role in feature extraction. However, there is still some space for development in the detection speed of R-CNN [11]. Fast R-CNN [7] is proposed in the following time period. Compared with R-CNN [11] method, Fast R-CNN [7] has some improvement in detection speed. There are still some improvements in the detection speed of Fast R-CNN [7]. To improve the detection speed, Faster R-CNN [10] method is proposed. Compared with Fast R-CNN [7], Faster R-CNN [10] has some improvement in detection speed. However, with the improvement of the real-time requirements of object detection tasks [1], the speed requirements of object detection are becoming higher and higher.

To increase the detecting speed even more, SSD [26] method is proposed. SSD [26] method draws inspiration from Faster R-CNN [10] and YOLO [27]. SSD [26] method uses multi-layer feature information to classify and locate objects. SSD [26] method is simple to implement and the detection speed is competitive. With the further development of science and technology, object detection is playing a more and more important role [17, 28] in computer vision tasks. At the same time, to meet the task requirements of different detection scenarios, the detection performance of the object detection algorithms needs to be improved. In the task of object detection, the algorithm needs to detect objects with different scales, which requires the object detection algorithm to be robust to scales. In digital images, we typically want to be able to recognize objects of various sizes [16, 28, 29]. In most object detection tasks, the small object accounts for a large proportion in the task scene [30]. Small objects exist in a large number in daily life, and have a very large application prospect [31]. Objects often have occlusion, blur and other phenomena, which seriously affect the performance of object detection [32]. In digital images, large-scale objects are easy to detect because of their large area and rich features. Small objects take up very little space in the digital images, and there is very little useful feature information [15, 16, 30].

A number of effective methods have been suggested to alleviate the problem of small object detection [9, 19, 33]. Based on the idea of fusing shallow features with deep features, Feature Pyramid Networks (FPN) [9] is proposed to increase the feature information of small objects. The characteristic information of small objects decreases with the increase of convolution depth [34,35,36]. Then the corresponding feature information in the shallow convolution layer will be relatively more. The FPN [9] method upsamples the deep feature information, and then adds the upsampled results to the shallow feature information of the convolution layer element by element [9]. Through this operation, a feature pyramid structure with different shapes and scales is constructed. After all, the upsampling operation of FPN [9] method also upsampling the deep features of convolution layer, and some features have been lost before upsampling. The depth of convolution layer is increased to increase receptive field. If the receptive field is increased and the loss of feature information is minimized, the feature information of small objects can also be increased. To increase receptive field and minimize the loss of feature information in small objects, DetNet [33] is proposed. DetNet [33] improves the detection accuracy of small objects using dilated convolution.

The anchor-based methods usually produce many invalid candidate boxes, and thus their performances are sometimes not satisfactory. Recently, the non maximum suppression (NMS) [37] method is used to remove the overlapping candidate boxes. Although there are some ways to alleviate this problem, they are not very ideal. Key points based methods are provided to alleviate this problem. The CornerNet [13] method uses the upper left corner and the lower right corner to detect the objects. CornerNet [13] method transforms the previous anchor mode into the key points mode for object detection. Therefore, a large number of invalid prediction box problems are avoided. The CornerNet [13] method needs two corners for object matching, which leads to a long matching time. Therefore, CenterNet [18] is proposed. The object detection task is transformed into a key point estimation issue using CenterNet [18]. CenterNet [18] predicts the category and location of objects by predicting the key point. Although some methods have improved the detection performance of small objects to a certain extent [38], there is still some space for development. The imbalance problem of detecting different scale objects is mainly the difficulty of small object detection. Improving the accuracy of small object detection can alleviate the imbalance problem of detecting different scale objects. At the same time, some objects may be missing when single-center-point based network is used for object detection. To alleviate the aforementioned problems, we propose a brand-new multiple space based cascaded center point network (MSCCPNet) for object detection.

Our proposed method

We describe our proposed multiple space based cascaded center point network (MSCCPNet) for object detection in this section. The proposed MSCCPNet uses center points in two different spaces to conduct object detection. In the subsections that follow, to alleviate the imbalance problem in detecting different scale objects, we first construct a SCAN structure. We then introduce a central point selection method for obtaining a better prediction result. Finally, we describe the implementation details of our proposed MSCCPNet.

Fig. 1
figure 1

Multiple space based cascaded center point network (MSCCPNet) for object detection architecture. The SCAN structure can alleviate the imbalance problem in detecting different scale objects by scanning more objects in different scale spaces. The Cascaded Center Point refers to the cascaded center point structure to predict the features of different spaces. Location Prediction comprises predictions for width and height as well as center point deviation

Multiple space based cascaded center point network (MSCCPNet)

To alleviate the problem that some objects are missing when a single-center-point based network is used for object detection, we propose a multiple space based cascaded center point network (MSCCPNet). It can predict the category and confidence of the object through the heat maps of two different sequent spaces. For clarity, we plot the architecture of MSCCPNet in Fig. 1. It is evident from Fig. 1 that the backbone network processes the input digital image. ResNet-50 is the backbone network employed in this paper. The input digital image is processed by 512 \(\times \) 512 dimensions. After the input digital image is extracted by ResNet-50 in MSCCPNet, we will get a feature map of 16 \(\times \) 16 \(\times \) 2048. Then, the output feature map of ResNet-50 is convoluted and transposed by the Convolution structure in Fig. 1 (the Convolution structure contains three ConvTranspose2d functions), and a feature map of 128 \(\times \) 128 \(\times \) 64 will be obtained. Next, the 128 \(\times \) 128 \(\times \) 64 feature map obtained through the Convolution structure processing will be processed by two structures, one of which is our proposed SCAN structure and the other is our proposed cascaded center point structure. The output of Convolution structure is processed by the three branches of the SCAN structure. The 128 \(\times \) 128 \(\times \) 64 feature map will be obtained through channel expansion, width and height reduction, pooling and convolution operation in the SCAN structure. The feature maps obtained by the Convolution structure processing and the SCAN structure processing will be respectively transferred to the cascaded center point structure. Cascaded center point structure will predict categories and confidence of objects in the digital image. Finally, we will perform center point deviation prediction, as well as width and height prediction, to determine the location of objects in the digital image.

SCAN structure

Fig. 2
figure 2

Different contours of a given object in shallow and deep layers. With the deepening of the network depth, the characteristic information of the object is reduced

As we all know, there is an imbalance problem in detecting different scale objects. Because of different task requirements, it is necessary to detect objects with different scales. We transfer the digital image into ResNet-50 [39], and then graphically display the shallow feature information and deep feature information. Figure 2 shows the shallow feature information and the deep feature information. From Fig. 2, we can clearly see the contour of objects in the shallow feature information and some spots of objects in the deep feature information. We can also see from Fig. 2 that with the deepening of the convolutional neural network, the feature information of objects is gradually abstracted. Some small-scale objects’ feature information will gradually disappear. The detection difficulty of objects with different scales will increase. To alleviate this problem, we propose a SCAN structure.

As we can see from Fig. 1, our proposed SCAN structure processes the input features according to the following steps. The first step is to quadruple the number of channels, half the width and height of the input feature [40]. In the first branch, a feature value is selected at every interval pixel point, and then the feature graph is stacked. The result of this operation is that the width and height of the feature map are compressed. It can not only expand the number of channels, but also expand the receptive field. Secondly, we pool the characteristics of the input. Third, we fuse the pooling results and the characteristics of the input, as well as reduce the width and height to facilitate stacking with the results of the first step. Like a convolution operation, the resulting SCAN can be used to reduce the loss of features in some sense.

Cascaded center point structure

Fig. 3
figure 3

Heat map representation of cascaded center point selection

To improve the detection performance of original single-center-point based network [18], we propose a cascaded center point structure. For clarity, we briefly describe the procedure of our proposed cascaded center point structure as follows. We first generate a feature map using the original single-center-point method. For alleviate the imbalance problem in detecting different scale objects, we further use our proposed SCAN structure, as well as the original single-center-point method, to generate the second feature map. Finally, we combine the relevant heat maps from the two feature maps to predict the categories and confidence of objects.

If we transfer an image of PASCAL VOC datasets to MSCCPNet. PASCAL VOC datasets have 20 categories when performing object detection tasks. In that way, when we predict the object category in MSCCPNet, the number of channels of feature map is 20. These 20 channels represent 20 object categories respectively. Similarly, if our proposed MSCCPNet predicts the objects on the COCO datasets, we will get the feature map of 80 channels (the COCO datasets in object detection contain 80 object categories). These 80 channels correspond to 80 object categories respectively. The categories and confidence of objects will be predicted in the prediction of the heat map. Figure 3 shows heat maps of cascade center point selection. It can be seen from Fig. 3 that cascaded center point can integrate the results of single-center-point to obtain better results. We can see that in the generation diagram in the left lower corner of Fig. 3, there are light spots in the 4-th row and 2-nd column, which mean that there are objects here. In Fig. 3, because we give an example of the processing of an image in PASCAL VOC datasets in cascaded center point structure and PASCAL VOC datasets have 20 object categories, so the 2-nd column in the 4-th row corresponds to the sheep category in PASCAL VOC datasets. We reserve high confidence prediction in cascaded center point structure, so we proposed cascaded center point structure outperforms original single-center-point based network to a certain extent.

Note that the cascaded center point structure predicts the feature maps of two different sequent spaces. We integrate the results of the two centers with the idea of choosing the high confidence and discarding the low confidence, and thus, we can get the results of the whole network about the category and confidence of the object.

Implementation details

We use the PyTorch [41] framework to implement MSCCPNet. We use the residual neural network [39] as a backbone network. GeForce RTX 2080Ti is used to implement our proposed MSCCPNet.

Following are the specifics of training on PASCAL VOC datasets: Adam [42] is the optimizer in use. The input digital image is processed by 512 \(\times \) 512 dimensions. 150 epochs are used to train our proposed MSCCPNet. The learning rate and weight decay for the training in epochs 1–50 are 1e−3 and 5e−4, respectively. The learning rate and weight decay for the training in epochs 51–100 are 1e−4 and 5e−4, respectively. The learning rate and weight decay for training in epochs 101–150 are 1e−5 and 5e−4, respectively. ResNet-50 is the backbone in use. The batchsize in use is 4.

Following are the specifics of training on COCO datasets: Our proposed MSCCPNet has 900,000 training iterations. A 1.25e\(-\)4 learning rate is being applied. 1e−4 is the weight decay. The optimizer in use is SGD. The input digital image is processed by 512 \(\times \) 512 dimensions. ResNet-50 serves as the backbone. Gamma is set to 0.9 and batchsize is set to 4.

The total loss function utilized in [12, 13, 18] of the object detection algorithm is shown as follows:

$$\begin{aligned} L_{\det }&= L_{\text {k}} + \lambda _\textrm{size}L_\textrm{size} + \lambda _\textrm{off}L_\textrm{off} \end{aligned}$$
(1)

The loss of the key point, denoted by the symbol \(L_{\text {k}}\), is here expressed as focal loss [43]. \(L_\textrm{size}\) represents the predicted loss in width and height. The predicted loss of center point deviation is \(L_\textrm{off}\). Weights called \(\lambda _\textrm{size}\) and \(\lambda _\textrm{off}\) are used to balance out each part’s losses.

Prediction

When predicting a digital image, we first transfer the digital image into the backbone for forward propagation processing. The backbone in this paper is ResNet-50. Then we use SCAN structure to process the result of Convolution structure and obtain high-quality feature information. Next, we acquire the categories and confidence of the objects using the cascaded center point structure, and we locate the objects using the center point deviation prediction, width and height prediction. After getting the location and categories of the predicted objects, we can display the location and categories on digital images, so as to complete the object detection task.

Experiments

We test our suggested MSCCPNet on PASCAL VOC [44] datasets and COCO datasets [45] in this section, comparing it against seven comparable object identification methods to see how effective it is. We first introduce the datasets and evaluation indicators in the following subsections. Then, we will conduct the ablation experiment. Finally, we will compare our proposed MSCCPNet with seven object detection algorithms.

Datasets

In the field of object detection, datasets like COCO [45] and PASCAL VOC [44] are frequently employed. PASCAL VOC datasets and COCO datasets can be used in computer vision tasks such as human posture estimation [46, 47], object segmentation [48, 49], object detection, etc. PASCAL VOC 2007 datasets contain 9963 images and 24,640 annotated objects. PASCAL VOC 2012 trainval sets contain 11,530 images and 27,450 annotated objects. The PASCAL VOC datasets include 20 object categories such as bus, chair, table and boat. The PASCAL VOC datasets [44] are better suited for performance analysis of model algorithms since they have good overall image quality and a comparatively thorough annotation. The PASCAL VOC datasets [44] provide a standard dataset format. The PASCAL VOC datasets [44] have three important folders, namely JPEGImages, Annotations and ImageSets. The JPEGImages folder contains images for training and testing. The Annotations folder contains tag data in XML format. Each XML tag data corresponds to each image in the JPEGImages folder. The ImageSets folder contains the txt file containing the image name.

Table 1 The ablation experiment is trained on COCO2017 train datasets and tested on COCO2017 validation datasets (COCO minival set)

Compared with PASCAL VOC [44], the detection of COCO datasets [45] is more difficult. The COCO datasets have a large number of images. In the field of object detection, we generally use 80 categories of COCO datasets [45]. The COCO datasets [45] have more than 300,000 images and more than 2,000,000 instances. The object size span and quantity of little objects in the COCO datasets [45] are both quite large.

Evaluation indicators

The quality of a object detection algorithm needs a rule standard to evaluate. The evaluation of object detection algorithms is complex. For specific objects, the detection algorithm is generally judged by the coincidence degree between the prediction box and the ground true box. The Intersection of Union (IOU) is generally used to quantify the degree of coincidence [11, 26, 50]. IOU refers to the ratio of intersection and union of the prediction box and the ground true box. For IOU, we often set a threshold to judge whether the prediction box is correct. We usually set the threshold to 0.5. When the IOU value is greater than 0.5, we think that the prediction box is correct. When the IOU value is less than 0.5, we think that the prediction box is wrong, that is, the algorithm detection is invalid.

Because the prediction box is divided into right and wrong, four samples will be generated when evaluating the object algorithm. The four samples are: True Positive (TP), False Positive (FP), False Negative (FN) and True Negative (TN). We will be able to understand each prediction box’s characteristics after exploring it (that is, TP or FP). We can calculate the Recall rate and Precision rate using TP. Recall, Precision and AP are shown as follows:

$$\begin{aligned} R_\mathrm{(recall)}&= \frac{{\text {TP}}}{{\text {TP}}+{\text {FN}}} \end{aligned}$$
(2)
$$\begin{aligned} P_\mathrm{(precision)}&= \frac{{\text {TP}}}{{\text {TP}}+{\text {FP}}} \end{aligned}$$
(3)
$$\begin{aligned} {\text {AP}}&= \int _0^1P{\text {d}}R \end{aligned}$$
(4)

When traversing each prediction box, a P and R corresponding to each other can be generated. A point can be made by P and R. The PR curve can be drawn by connecting all the points together. AP is the area of the PR curve.

Table 2 The ablation experimental results on PASCAL VOC 2007+2012 datasets
Table 3 Experimental results on PASCAL VOC 2007+2012 datasets

Ablation analysis

To show the usefulness of MSCCPNet, we conduct ablation experiments on the SCAN structure and cascaded center point structure. We think that the basic algorithm model has no SCAN structure and cascaded center point structure, so the basic algorithm model named BaseNet.

Fig. 4
figure 4

The test results of BaseNet and MSCCPNet in 80 categories. The blue bar chart represents the results of BaseNet, and the red part represents the improvement of MSCCPNet (ours) compared with BaseNet

The ablation experiment Of SCAN structure

To demonstrate the effectiveness of our proposed SCAN structure, we carry out verification on PASCAL VOC and COCO2017 datasets. For clarity, we consider four cases: (1) BaseNet. (2) SCAN+BaseNet. (3) Cascaded Center Point+BaseNet. (4) our proposed MSCCPNet. Note that our proposed MSCCPNet is a BaseNet based on cascaded center point structure and SCAN structure. Table 1 displays the outcomes of the ablation experiments after training on the COCO2017 train datasets and testing on the COCO2017 validation datasets (COCO minival set). It is evident from Table 1 that the SCAN+BaseNet has better detection performance than BaseNet. It is evident from Table 1 that \({\text {AP}}_{{\text {S}}}\), \({\text {AP}}_{{\text {M}}}\) and \({\text {AP}}_{{\text {L}}}\) obtained by SCAN+BaseNet are 14.7%, 39.2% and 49.4% respectively, which are higher than those obtained by BaseNet. It is evident from Table 1 that \({\text {AR}}_{{\text {S}}}\), \({\text {AR}}_{{\text {M}}}\) and \({\text {AR}}_{{\text {L}}}\) obtained by SCAN+BaseNet are 23.9%, 53.7% and 70.0% respectively, which are higher than those obtained by BaseNet. In Table 1, the \({\text {AP}}_{{\text {S}}}\) obtained by SCAN+BaseNet is 14.7%, 0.9% higher than that obtained by BaseNet. The above test results show that the SCAN structure improves the detection accuracy of small objects, and the detection performance on large-scale objects and medium-scale objects is also improved. In Table 1, the AP obtained by SCAN+BaseNet is 33.2%, 0.7% higher than that obtained by BaseNet. The result of the ablation experiments are shown in Table 2 after they are trained on PASCAL VOC 2007+2012 train datasets and tested on PASCAL VOC 2007 test datasets. It can be seen from Table 2 that for the majority of object categories, the SCAN+BaseNet achieves good detection accuracy. It is evident from Tables 1 and 2 that the SCAN structure can alleviate the imbalance problem in detecting different scale objects.

The ablation experiment of cascaded center point structure

To demonstrate the effectiveness of our proposed cascaded center point structure, we carry out verification on the PASCAL VOC datasets and COCO2017 datasets. It is evident from Table 1 that AP, \({\text {AP}}_{50}\) and \({\text {AP}}_{75}\) obtained by Cascaded Center Point+BaseNet are 33.3%, 51.8% and 35.4% respectively, which are higher than those obtained by BaseNet. It is evident from Table 1 that \({\text {AR}}_{1}\), \({\text {AR}}_{10}\) and \({\text {AR}}_{100}\) obtained by Cascaded Center Point+BaseNet are 29.3%, 46.4% and 48.5% respectively, which are higher than those obtained by BaseNet. In Table 1, the AP obtained by Cascaded Center Point+BaseNet is 33.3%, 0.8% higher than that obtained by BaseNet. AP and \({\text {AR}}_{1}\) can better reflect the accuracy of sample detection. The \({\text {AR}}_{1}\) obtained by Cascaded Center Point+BaseNet is 0.6% higher than that obtained by BaseNet, indicating that the cascaded center point structure outperforms original single-center-point based network.

Table 4 Experimental results on PASCAL VOC 2007 datasets

Figure 4 shows the test results of BaseNet and MSCCPNet in 80 categories. It is evident from Fig. 4 that MSCCPNet has higher detection accuracy than BaseNet in most object categories. Figure 5 shows the qualitative results of the above four methods. It is evident from Fig. 5 that MSCCPNet has higher detection accuracy than others. It is evident from Fig. 5 that sofa is detected correctly on the MSCCPNet (sofa in the 1-st row in Fig. 5) and sofa is detected incorrectly on the BaseNet. At the same time, we can see in Fig. 5 that the detection accuracy of MSCCPNet on small-scale objects has been improved (objects in the 5-th row in Fig. 5). In Fig. 5, we can see that the detection accuracy of medium-scale objects has been improved (objects in the 2-nd row in Fig. 5). The detection accuracy of MSCCPNet on large-scale objects in rows 3, 4, 5 and 6 in Fig. 5 has also been improved. MSCCPNet’s detection performance is satisfactory. Therefore, the SCAN structure and cascaded center point structure play a positive and effective role in object detection when they are integrated into our proposed method. We take some images in our daily life. Figure 6 shows the detection results of images in our daily life. It is evident from Fig. 6 that our proposed method is also competitive in detection application.

Quantitative comparisons

To further highlight the efficiency of our proposed MSCCPNet, we compare it to other standard object detection algorithms.

When the six methods are trained on PASCAL VOC 2007+2012 train datasets and tested on PASCAL VOC 2007 test datasets, the results are shown in Table 3. It is evident from Table 3 that MSCCPNet has high detection accuracy on 13 objects including bike, cat, bird, bus, bottle, cow and dog. When the six methods are trained on PASCAL VOC 2007 train datasets and tested on PASCAL VOC 2007 test datasets, the results are shown in Table 4. It is evident from table 4 that MSCCPNet has high detection accuracy on 17 objects including bike, bird, horse, bus, cat, cow and dog. Results on PASCAL VOC datasets demonstrate the validity of MSCCPNet and its competitiveness compared to other methods.

Fig. 5
figure 5

Qualitative results of ablation experiments. a BaseNet, b SCAN+BaseNet, c Cascaded Center Point+BaseNet, d MSCCPNet

Fig. 6
figure 6

Test results of images taken in our daily life. a BaseNet, b SCAN+BaseNet, c Cascaded Center Point+BaseNet, d MSCCPNet

When the six methods are trained on COCO2017 train datasets and tested on COCO2017 validation datasets (COCO minival set), the results are shown in Table 5. It can be clearly seen from Table 5 that \({\text {AP}}_{\text {S}}\), \({\text {AP}}_{\text {M}}\) and \({\text {AP}}_{\text {L}}\) obtained by MSCCPNet are 14.4%, 39.2% and 49.2% respectively, which are 1%, 2.2% and 3.6% higher than \({\text {AP}}_{\text {S}}\), \({\text {AP}}_{\text {M}}\) and \({\text {AP}}_{\text {L}}\) obtained by EfficientDet-D1 [51]. It can be clearly seen from Table 5 that \({\text {AP}}_{\text {S}}\), \({\text {AP}}_{\text {M}}\) and \({\text {AP}}_{\text {L}}\) obtained by MSCCPNet are 14.4%, 39.2% and 49.2% respectively, which are 0.6%, 0.6% and 2.4% higher than \({\text {AP}}_{\text {S}}\), \({\text {AP}}_{\text {M}}\) and \({\text {AP}}_{\text {L}}\) obtained by CenterNet [18]. It can be clearly seen from Table 5 that \({\text {AP}}_{\text {S}}\), \({\text {AP}}_{\text {M}}\) and \({\text {AP}}_{\text {L}}\) obtained by MSCCPNet are 14.4%, 39.2% and 49.2% respectively, which are 0.4%, 5.7% and 5.7% higher than \({\text {AP}}_{\text {S}}\), \({\text {AP}}_{\text {M}}\) and \({\text {AP}}_{\text {L}}\) obtained by RetinaNet [43]. It is evident from Table 5 that \({\text {AP}}_{\text {S}}\), \({\text {AP}}_{\text {M}}\) and \({\text {AP}}_{\text {L}}\) obtained by MSCCPNet are 14.4%, 39.2% and 49.2% respectively, which are higher than most methods. This shows that MSCCPNet can play a positive role in alleviating the imbalance problem in detecting different scale objects. We can also see that \({\text {AR}}_{1}\), \({\text {AR}}_{10}\) and \({\text {AR}}_{100}\) obtained by MSCCPNet are 29.5%, 46.5% and 48.5% respectively, which are higher than most methods. It is evident from Table 5 that MSCCPNet is competitive in detection speed and detection accuracy. It is evident from Table 5 that the AP of MSCCPNet is 1% higher than that of CenterNet [18] when the detection speed is similar to that of CenterNet [18]. While frames per second (FPS) of EfficientDet-D0 [51] is 39, frames per second (FPS) of MSCCPNet is 60. And the AP of MSCCPNet is 5.6% higher than EfficientDet-D0. While frames per second (FPS) of EfficientDet-D2 [51] is 30, frames per second (FPS) of MSCCPNet is 60. And the AP of MSCCPNet is 1.0% higher than EfficientDet-D2 [51]. This shows that MSCCPNet not only improves the detection speed, but also improves the detection accuracy.

When the methods with different dimensions are trained on COCO2017 train datasets and tested on COCO2017 validation datasets (COCO minival set), the results are shown in Table 6. It is evident from Table 6 that our proposed MSCCPNet obtains different results when processed by different dimensions. Our proposed MSCCPNet achieves better results as the dimensions increase. This indicates that the larger the dimension of the input image being processed, the richer the features extracted from the model, and the better the detection performance. When the methods with different backbone are trained on COCO2017 train datasets and tested on COCO2017 validation datasets (COCO minival set), the results are shown in Table 7. It is evident from Table 7 that MSCCPNet can also achieve better results on different backbone networks. Therefore, our proposed MSCCPNet has certain scientific research value.

Table 5 Experimental results on COCO2017 datasets
Table 6 The performance of different dimensions on COCO2017 datasets
Table 7 The performance of different backbone networks on COCO2017 datasets

Conclusion

We propose a brand-new multiple space based cascaded center point network (MSCCPNet) for object detection in this paper. It based on the SCAN structure and cascaded center point structure. It can obtain the final detection result through cascaded heat maps prediction, center point deviation prediction, as well as width and height prediction. Given appropriate training strategies, our proposed SCAN structure can alleviate the imbalance problem in detecting different scale objects, and the proposed cascaded center point structure outperforms original single-center-point based network. Experimental results demonstrate that under a unified training and testing environment, our proposed MSCCPNet is effective and competitive when compared with several representative object detection algorithms. Our proposed method improves detection accuracy while ensuring detection speed. However, it still faces some challenges in occupying storage space, affecting the performance of industrial applications. To address the above challenges, we will construct a new method for object detection tasks by the multi-resolution technology in the future.