Multiple space based cascaded center point network for object detection

Jiang, Zhiqiang; Dong, Yongsheng; Pei, Yuanhua; Zheng, Lintao; Tao, Fazhan; Fu, Zhumu

doi:10.1007/s40747-023-01102-7

Multiple space based cascaded center point network for object detection

Original Article
Open access
Published: 23 June 2023

Volume 9, pages 7213–7225, (2023)
Cite this article

Download PDF

You have full access to this open access article

Complex & Intelligent Systems Aims and scope Submit manuscript

Multiple space based cascaded center point network for object detection

Download PDF

Zhiqiang Jiang¹,
Yongsheng Dong ORCID: orcid.org/0000-0002-6281-9658¹,
Yuanhua Pei¹,
Lintao Zheng¹,
Fazhan Tao¹ &
…
Zhumu Fu¹

910 Accesses
1 Citation
Explore all metrics

Abstract

For the purpose of object detection, numerous key points based methods have been suggested. To alleviate the imbalance problem that some objects may be missing when a single-center-point based network is used for object detection, we propose a brand-new multiple space based cascaded center point network (MSCCPNet) for object detection. Particularly, we first bulid a novel structure to alleviate the imbalance problem in detecting different scale objects by scanning more objects in different scale spaces. We then propose a cascaded center point structure to predict the category and confidence of the object by integrating the results of the two centers with the idea of choosing the high confidence and discarding the low confidence. Finally, we determine the object’s location by predicting the center point deviation as well as the width and height of the object. Our MSCCPNet shows competitive accuracy when compared with many sample classical object detection algorithms on GeForce RTX 2080Ti, according to the results of experiments on PASCAL VOC datasets and COCO datasets.

An Improved Object Detection Algorithm Based on CenterNet

Rethinking the Misalignment Problem in Dense Object Detection

Focal Loss for Region Proposal Network

Discover the latest articles, news and stories from top researchers in related subjects.

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Object detection is crucial in automatic driving [1, 2], object tracking [3] and other computer vision tasks [4,5,6]. With the advancement of deep learning technologies, object detection has advanced significantly [7, 8].

In the previous 20 years, the anchor-based methods [9,10,11] have good detection accuracy in object detection tasks [12]. However, the anchor-based methods face some challenges. Anchor are boxes of various sizes and aspect ratios that serve as detection objects [13,14,15]. The shape of objects [16, 17] in a digital image is randomly indeterminate. Artificially designed anchor cannot be well applied to objects with different shapes. Meanwhile, anchor-based object detection methods also have the problem of mismatching positive and negative samples [12, 13]. To alleviate these challenges faced by anchor, key points based methods are proposed [13]. CornerNet [13] turns the traditional idea of predicting anchor boxes into a way of predicting the top left and bottom right corners of a box. But CornerNet [13] is sensitive to detecting the boundary of objects, meanwhile not being aware of which pairs of keypoints should be grouped into objects [12, 13]. CenterNet [18] was proposed to solve this problem. CenterNet [18] is a center point based object detection method.

As we all know, the detection accuracy of small scale objects in digital images is significantly lower than that of medium-scale objects and large-scale objects in some sense. This is called an imbalance problem in object detection. To alleviate the imbalance problem in detecting different scale objects, many methods have been put out, such as enhancing object feature information, improving training strategies, data expansion and so on [19]. These methods have better detection results [20,21,22]. Aiming at the phenomenon of sparse fault samples in actual industrial environments, a method of fault diagnosis with few samples based on parameter optimization and feature metrics is proposed [23]. A method [24] that can enhance the ability of cross scale detection and focus on valuable areas has been proposed, which has significant advantages in improving the detection performance of small scale objects. These methods [23,24,25] have high research value to inspire us. In this paper, we propose a brand-new multiple space based cascaded center point network (MSCCPNet) to alleviate the object missing problem of the single-center-point based network. Object detection experiments show MSCCPNet is effective and outperforms the single-center-point based network. The following are this paper’s main contributions:

1.
We propose a novel multiple space based cascaded center point network (MSCCPNet) for object detection. The network can extract the features in different spaces and channels, and enhance object detecting efficiency by cascaded center point.
2.
We propose a novel structure named SCAN structure by expanding the number of channels, reducing the width and height of features, and performing pooling operation. SCAN structure can scan more objects in different scale spaces. So it can alleviate the imbalance problem in detecting different scale objects.
3
. We propose a cascaded center point structure. Cascaded center point structure predicts the feature maps of two different sequent spaces by integrating the results of the two centers with the idea of choosing the high confidence and discarding the low confidence. It outperforms original single-center-point based network.
4.
Experimental results show that our proposed MSCCPNet has satisfactory performance compared with other classical object detection algorithms. MSCCPNet has the highest detection accuracy among multiple object categories.

The rest of this paper is organized as follows. The following section presents the related work. The subsequent section introduces the proposed method. Experimental results are presented in “Experiments”. Finally, a succinct conclusion is provided in “Conclusion”.

Related work

The categories of anchor-based methods and key points based methods can be used to categorize deep learning-based object detection algorithms.

R-CNN [11] is an earlier method to deal with object detection tasks using anchors. R-CNN [11] combines region proposals with convolutional neural networks to play an excellent role in feature extraction. However, there is still some space for development in the detection speed of R-CNN [11]. Fast R-CNN [7] is proposed in the following time period. Compared with R-CNN [11] method, Fast R-CNN [7] has some improvement in detection speed. There are still some improvements in the detection speed of Fast R-CNN [7]. To improve the detection speed, Faster R-CNN [10] method is proposed. Compared with Fast R-CNN [7], Faster R-CNN [10] has some improvement in detection speed. However, with the improvement of the real-time requirements of object detection tasks [1], the speed requirements of object detection are becoming higher and higher.

To increase the detecting speed even more, SSD [26] method is proposed. SSD [26] method draws inspiration from Faster R-CNN [10] and YOLO [27]. SSD [26] method uses multi-layer feature information to classify and locate objects. SSD [26] method is simple to implement and the detection speed is competitive. With the further development of science and technology, object detection is playing a more and more important role [17, 28] in computer vision tasks. At the same time, to meet the task requirements of different detection scenarios, the detection performance of the object detection algorithms needs to be improved. In the task of object detection, the algorithm needs to detect objects with different scales, which requires the object detection algorithm to be robust to scales. In digital images, we typically want to be able to recognize objects of various sizes [16, 28, 29]. In most object detection tasks, the small object accounts for a large proportion in the task scene [30]. Small objects exist in a large number in daily life, and have a very large application prospect [31]. Objects often have occlusion, blur and other phenomena, which seriously affect the performance of object detection [32]. In digital images, large-scale objects are easy to detect because of their large area and rich features. Small objects take up very little space in the digital images, and there is very little useful feature information [15, 16, 30].

A number of effective methods have been suggested to alleviate the problem of small object detection [9, 19, 33]. Based on the idea of fusing shallow features with deep features, Feature Pyramid Networks (FPN) [9] is proposed to increase the feature information of small objects. The characteristic information of small objects decreases with the increase of convolution depth [34,35,36]. Then the corresponding feature information in the shallow convolution layer will be relatively more. The FPN [9] method upsamples the deep feature information, and then adds the upsampled results to the shallow feature information of the convolution layer element by element [9]. Through this operation, a feature pyramid structure with different shapes and scales is constructed. After all, the upsampling operation of FPN [9] method also upsampling the deep features of convolution layer, and some features have been lost before upsampling. The depth of convolution layer is increased to increase receptive field. If the receptive field is increased and the loss of feature information is minimized, the feature information of small objects can also be increased. To increase receptive field and minimize the loss of feature information in small objects, DetNet [33] is proposed. DetNet [33] improves the detection accuracy of small objects using dilated convolution.

The anchor-based methods usually produce many invalid candidate boxes, and thus their performances are sometimes not satisfactory. Recently, the non maximum suppression (NMS) [37] method is used to remove the overlapping candidate boxes. Although there are some ways to alleviate this problem, they are not very ideal. Key points based methods are provided to alleviate this problem. The CornerNet [13] method uses the upper left corner and the lower right corner to detect the objects. CornerNet [13] method transforms the previous anchor mode into the key points mode for object detection. Therefore, a large number of invalid prediction box problems are avoided. The CornerNet [13] method needs two corners for object matching, which leads to a long matching time. Therefore, CenterNet [18] is proposed. The object detection task is transformed into a key point estimation issue using CenterNet [18]. CenterNet [18] predicts the category and location of objects by predicting the key point. Although some methods have improved the detection performance of small objects to a certain extent [38], there is still some space for development. The imbalance problem of detecting different scale objects is mainly the difficulty of small object detection. Improving the accuracy of small object detection can alleviate the imbalance problem of detecting different scale objects. At the same time, some objects may be missing when single-center-point based network is used for object detection. To alleviate the aforementioned problems, we propose a brand-new multiple space based cascaded center point network (MSCCPNet) for object detection.

Our proposed method

We describe our proposed multiple space based cascaded center point network (MSCCPNet) for object detection in this section. The proposed MSCCPNet uses center points in two different spaces to conduct object detection. In the subsections that follow, to alleviate the imbalance problem in detecting different scale objects, we first construct a SCAN structure. We then introduce a central point selection method for obtaining a better prediction result. Finally, we describe the implementation details of our proposed MSCCPNet.

Multiple space based cascaded center point network (MSCCPNet)

To alleviate the problem that some objects are missing when a single-center-point based network is used for object detection, we propose a multiple space based cascaded center point network (MSCCPNet). It can predict the category and confidence of the object through the heat maps of two different sequent spaces. For clarity, we plot the architecture of MSCCPNet in Fig. 1. It is evident from Fig. 1 that the backbone network processes the input digital image. ResNet-50 is the backbone network employed in this paper. The input digital image is processed by 512 $\times $ 512 dimensions. After the input digital image is extracted by ResNet-50 in MSCCPNet, we will get a feature map of 16 $\times $ 16 $\times $ 2048. Then, the output feature map of ResNet-50 is convoluted and transposed by the Convolution structure in Fig. 1 (the Convolution structure contains three ConvTranspose2d functions), and a feature map of 128 $\times $ 128 $\times $ 64 will be obtained. Next, the 128 $\times $ 128 $\times $ 64 feature map obtained through the Convolution structure processing will be processed by two structures, one of which is our proposed SCAN structure and the other is our proposed cascaded center point structure. The output of Convolution structure is processed by the three branches of the SCAN structure. The 128 $\times $ 128 $\times $ 64 feature map will be obtained through channel expansion, width and height reduction, pooling and convolution operation in the SCAN structure. The feature maps obtained by the Convolution structure processing and the SCAN structure processing will be respectively transferred to the cascaded center point structure. Cascaded center point structure will predict categories and confidence of objects in the digital image. Finally, we will perform center point deviation prediction, as well as width and height prediction, to determine the location of objects in the digital image.

SCAN structure

As we all know, there is an imbalance problem in detecting different scale objects. Because of different task requirements, it is necessary to detect objects with different scales. We transfer the digital image into ResNet-50 [39], and then graphically display the shallow feature information and deep feature information. Figure 2 shows the shallow feature information and the deep feature information. From Fig. 2, we can clearly see the contour of objects in the shallow feature information and some spots of objects in the deep feature information. We can also see from Fig. 2 that with the deepening of the convolutional neural network, the feature information of objects is gradually abstracted. Some small-scale objects’ feature information will gradually disappear. The detection difficulty of objects with different scales will increase. To alleviate this problem, we propose a SCAN structure.

As we can see from Fig. 1, our proposed SCAN structure processes the input features according to the following steps. The first step is to quadruple the number of channels, half the width and height of the input feature [40]. In the first branch, a feature value is selected at every interval pixel point, and then the feature graph is stacked. The result of this operation is that the width and height of the feature map are compressed. It can not only expand the number of channels, but also expand the receptive field. Secondly, we pool the characteristics of the input. Third, we fuse the pooling results and the characteristics of the input, as well as reduce the width and height to facilitate stacking with the results of the first step. Like a convolution operation, the resulting SCAN can be used to reduce the loss of features in some sense.

Cascaded center point structure

To improve the detection performance of original single-center-point based network [18], we propose a cascaded center point structure. For clarity, we briefly describe the procedure of our proposed cascaded center point structure as follows. We first generate a feature map using the original single-center-point method. For alleviate the imbalance problem in detecting different scale objects, we further use our proposed SCAN structure, as well as the original single-center-point method, to generate the second feature map. Finally, we combine the relevant heat maps from the two feature maps to predict the categories and confidence of objects.

If we transfer an image of PASCAL VOC datasets to MSCCPNet. PASCAL VOC datasets have 20 categories when performing object detection tasks. In that way, when we predict the object category in MSCCPNet, the number of channels of feature map is 20. These 20 channels represent 20 object categories respectively. Similarly, if our proposed MSCCPNet predicts the objects on the COCO datasets, we will get the feature map of 80 channels (the COCO datasets in object detection contain 80 object categories). These 80 channels correspond to 80 object categories respectively. The categories and confidence of objects will be predicted in the prediction of the heat map. Figure 3 shows heat maps of cascade center point selection. It can be seen from Fig. 3 that cascaded center point can integrate the results of single-center-point to obtain better results. We can see that in the generation diagram in the left lower corner of Fig. 3, there are light spots in the 4-th row and 2-nd column, which mean that there are objects here. In Fig. 3, because we give an example of the processing of an image in PASCAL VOC datasets in cascaded center point structure and PASCAL VOC datasets have 20 object categories, so the 2-nd column in the 4-th row corresponds to the sheep category in PASCAL VOC datasets. We reserve high confidence prediction in cascaded center point structure, so we proposed cascaded center point structure outperforms original single-center-point based network to a certain extent.

Note that the cascaded center point structure predicts the feature maps of two different sequent spaces. We integrate the results of the two centers with the idea of choosing the high confidence and discarding the low confidence, and thus, we can get the results of the whole network about the category and confidence of the object.

Implementation details

We use the PyTorch [41] framework to implement MSCCPNet. We use the residual neural network [39] as a backbone network. GeForce RTX 2080Ti is used to implement our proposed MSCCPNet.

Following are the specifics of training on PASCAL VOC datasets: Adam [42] is the optimizer in use. The input digital image is processed by 512 $\times $ 512 dimensions. 150 epochs are used to train our proposed MSCCPNet. The learning rate and weight decay for the training in epochs 1–50 are 1e−3 and 5e−4, respectively. The learning rate and weight decay for the training in epochs 51–100 are 1e−4 and 5e−4, respectively. The learning rate and weight decay for training in epochs 101–150 are 1e−5 and 5e−4, respectively. ResNet-50 is the backbone in use. The batchsize in use is 4.

Following are the specifics of training on COCO datasets: Our proposed MSCCPNet has 900,000 training iterations. A 1.25e$-$4 learning rate is being applied. 1e−4 is the weight decay. The optimizer in use is SGD. The input digital image is processed by 512 $\times $ 512 dimensions. ResNet-50 serves as the backbone. Gamma is set to 0.9 and batchsize is set to 4.

The total loss function utilized in [12, 13, 18] of the object detection algorithm is shown as follows:

$$\begin{aligned} L_{\det }&= L_{\text {k}} + \lambda _\textrm{size}L_\textrm{size} + \lambda _\textrm{off}L_\textrm{off} \end{aligned}$$

(1)

The loss of the key point, denoted by the symbol $L_{\text {k}}$, is here expressed as focal loss [43]. $L_\textrm{size}$ represents the predicted loss in width and height. The predicted loss of center point deviation is $L_\textrm{off}$. Weights called $\lambda _\textrm{size}$ and $\lambda _\textrm{off}$ are used to balance out each part’s losses.

Prediction

When predicting a digital image, we first transfer the digital image into the backbone for forward propagation processing. The backbone in this paper is ResNet-50. Then we use SCAN structure to process the result of Convolution structure and obtain high-quality feature information. Next, we acquire the categories and confidence of the objects using the cascaded center point structure, and we locate the objects using the center point deviation prediction, width and height prediction. After getting the location and categories of the predicted objects, we can display the location and categories on digital images, so as to complete the object detection task.

Experiments

We test our suggested MSCCPNet on PASCAL VOC [44] datasets and COCO datasets [45] in this section, comparing it against seven comparable object identification methods to see how effective it is. We first introduce the datasets and evaluation indicators in the following subsections. Then, we will conduct the ablation experiment. Finally, we will compare our proposed MSCCPNet with seven object detection algorithms.

Datasets

In the field of object detection, datasets like COCO [45] and PASCAL VOC [44] are frequently employed. PASCAL VOC datasets and COCO datasets can be used in computer vision tasks such as human posture estimation [46, 47], object segmentation [48, 49], object detection, etc. PASCAL VOC 2007 datasets contain 9963 images and 24,640 annotated objects. PASCAL VOC 2012 trainval sets contain 11,530 images and 27,450 annotated objects. The PASCAL VOC datasets include 20 object categories such as bus, chair, table and boat. The PASCAL VOC datasets [44] are better suited for performance analysis of model algorithms since they have good overall image quality and a comparatively thorough annotation. The PASCAL VOC datasets [44] provide a standard dataset format. The PASCAL VOC datasets [44] have three important folders, namely JPEGImages, Annotations and ImageSets. The JPEGImages folder contains images for training and testing. The Annotations folder contains tag data in XML format. Each XML tag data corresponds to each image in the JPEGImages folder. The ImageSets folder contains the txt file containing the image name.

Table 1 The ablation experiment is trained on COCO2017 train datasets and tested on COCO2017 validation datasets (COCO minival set)

Full size table

Compared with PASCAL VOC [44], the detection of COCO datasets [45] is more difficult. The COCO datasets have a large number of images. In the field of object detection, we generally use 80 categories of COCO datasets [45]. The COCO datasets [45] have more than 300,000 images and more than 2,000,000 instances. The object size span and quantity of little objects in the COCO datasets [45] are both quite large.

Evaluation indicators

The quality of a object detection algorithm needs a rule standard to evaluate. The evaluation of object detection algorithms is complex. For specific objects, the detection algorithm is generally judged by the coincidence degree between the prediction box and the ground true box. The Intersection of Union (IOU) is generally used to quantify the degree of coincidence [11, 26, 50]. IOU refers to the ratio of intersection and union of the prediction box and the ground true box. For IOU, we often set a threshold to judge whether the prediction box is correct. We usually set the threshold to 0.5. When the IOU value is greater than 0.5, we think that the prediction box is correct. When the IOU value is less than 0.5, we think that the prediction box is wrong, that is, the algorithm detection is invalid.

Because the prediction box is divided into right and wrong, four samples will be generated when evaluating the object algorithm. The four samples are: True Positive (TP), False Positive (FP), False Negative (FN) and True Negative (TN). We will be able to understand each prediction box’s characteristics after exploring it (that is, TP or FP). We can calculate the Recall rate and Precision rate using TP. Recall, Precision and AP are shown as follows:

$$\begin{aligned} R_\mathrm{(recall)}&= \frac{{\text {TP}}}{{\text {TP}}+{\text {FN}}} \end{aligned}$$

(2)

$$\begin{aligned} P_\mathrm{(precision)}&= \frac{{\text {TP}}}{{\text {TP}}+{\text {FP}}} \end{aligned}$$

(3)

$$\begin{aligned} {\text {AP}}&= \int _0^1P{\text {d}}R \end{aligned}$$

(4)

When traversing each prediction box, a P and R corresponding to each other can be generated. A point can be made by P and R. The P–R curve can be drawn by connecting all the points together. AP is the area of the P–R curve.

Table 2 The ablation experimental results on PASCAL VOC 2007+2012 datasets

Full size table

Table 3 Experimental results on PASCAL VOC 2007+2012 datasets

Full size table

Ablation analysis

To show the usefulness of MSCCPNet, we conduct ablation experiments on the SCAN structure and cascaded center point structure. We think that the basic algorithm model has no SCAN structure and cascaded center point structure, so the basic algorithm model named BaseNet.

The ablation experiment Of SCAN structure

To demonstrate the effectiveness of our proposed SCAN structure, we carry out verification on PASCAL VOC and COCO2017 datasets. For clarity, we consider four cases: (1) BaseNet. (2) SCAN+BaseNet. (3) Cascaded Center Point+BaseNet. (4) our proposed MSCCPNet. Note that our proposed MSCCPNet is a BaseNet based on cascaded center point structure and SCAN structure. Table 1 displays the outcomes of the ablation experiments after training on the COCO2017 train datasets and testing on the COCO2017 validation datasets (COCO minival set). It is evident from Table 1 that the SCAN+BaseNet has better detection performance than BaseNet. It is evident from Table 1 that ${\text {AP}}_{{\text {S}}}$, ${\text {AP}}_{{\text {M}}}$ and ${\text {AP}}_{{\text {L}}}$ obtained by SCAN+BaseNet are 14.7%, 39.2% and 49.4% respectively, which are higher than those obtained by BaseNet. It is evident from Table 1 that ${\text {AR}}_{{\text {S}}}$, ${\text {AR}}_{{\text {M}}}$ and ${\text {AR}}_{{\text {L}}}$ obtained by SCAN+BaseNet are 23.9%, 53.7% and 70.0% respectively, which are higher than those obtained by BaseNet. In Table 1, the ${\text {AP}}_{{\text {S}}}$ obtained by SCAN+BaseNet is 14.7%, 0.9% higher than that obtained by BaseNet. The above test results show that the SCAN structure improves the detection accuracy of small objects, and the detection performance on large-scale objects and medium-scale objects is also improved. In Table 1, the AP obtained by SCAN+BaseNet is 33.2%, 0.7% higher than that obtained by BaseNet. The result of the ablation experiments are shown in Table 2 after they are trained on PASCAL VOC 2007+2012 train datasets and tested on PASCAL VOC 2007 test datasets. It can be seen from Table 2 that for the majority of object categories, the SCAN+BaseNet achieves good detection accuracy. It is evident from Tables 1 and 2 that the SCAN structure can alleviate the imbalance problem in detecting different scale objects.

The ablation experiment of cascaded center point structure

To demonstrate the effectiveness of our proposed cascaded center point structure, we carry out verification on the PASCAL VOC datasets and COCO2017 datasets. It is evident from Table 1 that AP, ${\text {AP}}_{50}$ and ${\text {AP}}_{75}$ obtained by Cascaded Center Point+BaseNet are 33.3%, 51.8% and 35.4% respectively, which are higher than those obtained by BaseNet. It is evident from Table 1 that ${\text {AR}}_{1}$, ${\text {AR}}_{10}$ and ${\text {AR}}_{100}$ obtained by Cascaded Center Point+BaseNet are 29.3%, 46.4% and 48.5% respectively, which are higher than those obtained by BaseNet. In Table 1, the AP obtained by Cascaded Center Point+BaseNet is 33.3%, 0.8% higher than that obtained by BaseNet. AP and ${\text {AR}}_{1}$ can better reflect the accuracy of sample detection. The ${\text {AR}}_{1}$ obtained by Cascaded Center Point+BaseNet is 0.6% higher than that obtained by BaseNet, indicating that the cascaded center point structure outperforms original single-center-point based network.

Table 4 Experimental results on PASCAL VOC 2007 datasets

Full size table

Figure 4 shows the test results of BaseNet and MSCCPNet in 80 categories. It is evident from Fig. 4 that MSCCPNet has higher detection accuracy than BaseNet in most object categories. Figure 5 shows the qualitative results of the above four methods. It is evident from Fig. 5 that MSCCPNet has higher detection accuracy than others. It is evident from Fig. 5 that sofa is detected correctly on the MSCCPNet (sofa in the 1-st row in Fig. 5) and sofa is detected incorrectly on the BaseNet. At the same time, we can see in Fig. 5 that the detection accuracy of MSCCPNet on small-scale objects has been improved (objects in the 5-th row in Fig. 5). In Fig. 5, we can see that the detection accuracy of medium-scale objects has been improved (objects in the 2-nd row in Fig. 5). The detection accuracy of MSCCPNet on large-scale objects in rows 3, 4, 5 and 6 in Fig. 5 has also been improved. MSCCPNet’s detection performance is satisfactory. Therefore, the SCAN structure and cascaded center point structure play a positive and effective role in object detection when they are integrated into our proposed method. We take some images in our daily life. Figure 6 shows the detection results of images in our daily life. It is evident from Fig. 6 that our proposed method is also competitive in detection application.

Quantitative comparisons

To further highlight the efficiency of our proposed MSCCPNet, we compare it to other standard object detection algorithms.

When the six methods are trained on PASCAL VOC 2007+2012 train datasets and tested on PASCAL VOC 2007 test datasets, the results are shown in Table 3. It is evident from Table 3 that MSCCPNet has high detection accuracy on 13 objects including bike, cat, bird, bus, bottle, cow and dog. When the six methods are trained on PASCAL VOC 2007 train datasets and tested on PASCAL VOC 2007 test datasets, the results are shown in Table 4. It is evident from table 4 that MSCCPNet has high detection accuracy on 17 objects including bike, bird, horse, bus, cat, cow and dog. Results on PASCAL VOC datasets demonstrate the validity of MSCCPNet and its competitiveness compared to other methods.

When the six methods are trained on COCO2017 train datasets and tested on COCO2017 validation datasets (COCO minival set), the results are shown in Table 5. It can be clearly seen from Table 5 that ${\text {AP}}_{\text {S}}$, ${\text {AP}}_{\text {M}}$ and ${\text {AP}}_{\text {L}}$ obtained by MSCCPNet are 14.4%, 39.2% and 49.2% respectively, which are 1%, 2.2% and 3.6% higher than ${\text {AP}}_{\text {S}}$, ${\text {AP}}_{\text {M}}$ and ${\text {AP}}_{\text {L}}$ obtained by EfficientDet-D1 [51]. It can be clearly seen from Table 5 that ${\text {AP}}_{\text {S}}$, ${\text {AP}}_{\text {M}}$ and ${\text {AP}}_{\text {L}}$ obtained by MSCCPNet are 14.4%, 39.2% and 49.2% respectively, which are 0.6%, 0.6% and 2.4% higher than ${\text {AP}}_{\text {S}}$, ${\text {AP}}_{\text {M}}$ and ${\text {AP}}_{\text {L}}$ obtained by CenterNet [18]. It can be clearly seen from Table 5 that ${\text {AP}}_{\text {S}}$, ${\text {AP}}_{\text {M}}$ and ${\text {AP}}_{\text {L}}$ obtained by MSCCPNet are 14.4%, 39.2% and 49.2% respectively, which are 0.4%, 5.7% and 5.7% higher than ${\text {AP}}_{\text {S}}$, ${\text {AP}}_{\text {M}}$ and ${\text {AP}}_{\text {L}}$ obtained by RetinaNet [43]. It is evident from Table 5 that ${\text {AP}}_{\text {S}}$, ${\text {AP}}_{\text {M}}$ and ${\text {AP}}_{\text {L}}$ obtained by MSCCPNet are 14.4%, 39.2% and 49.2% respectively, which are higher than most methods. This shows that MSCCPNet can play a positive role in alleviating the imbalance problem in detecting different scale objects. We can also see that ${\text {AR}}_{1}$, ${\text {AR}}_{10}$ and ${\text {AR}}_{100}$ obtained by MSCCPNet are 29.5%, 46.5% and 48.5% respectively, which are higher than most methods. It is evident from Table 5 that MSCCPNet is competitive in detection speed and detection accuracy. It is evident from Table 5 that the AP of MSCCPNet is 1% higher than that of CenterNet [18] when the detection speed is similar to that of CenterNet [18]. While frames per second (FPS) of EfficientDet-D0 [51] is 39, frames per second (FPS) of MSCCPNet is 60. And the AP of MSCCPNet is 5.6% higher than EfficientDet-D0. While frames per second (FPS) of EfficientDet-D2 [51] is 30, frames per second (FPS) of MSCCPNet is 60. And the AP of MSCCPNet is 1.0% higher than EfficientDet-D2 [51]. This shows that MSCCPNet not only improves the detection speed, but also improves the detection accuracy.

When the methods with different dimensions are trained on COCO2017 train datasets and tested on COCO2017 validation datasets (COCO minival set), the results are shown in Table 6. It is evident from Table 6 that our proposed MSCCPNet obtains different results when processed by different dimensions. Our proposed MSCCPNet achieves better results as the dimensions increase. This indicates that the larger the dimension of the input image being processed, the richer the features extracted from the model, and the better the detection performance. When the methods with different backbone are trained on COCO2017 train datasets and tested on COCO2017 validation datasets (COCO minival set), the results are shown in Table 7. It is evident from Table 7 that MSCCPNet can also achieve better results on different backbone networks. Therefore, our proposed MSCCPNet has certain scientific research value.

Table 5 Experimental results on COCO2017 datasets

Full size table

Table 6 The performance of different dimensions on COCO2017 datasets

Full size table

Table 7 The performance of different backbone networks on COCO2017 datasets

Full size table

Conclusion

We propose a brand-new multiple space based cascaded center point network (MSCCPNet) for object detection in this paper. It based on the SCAN structure and cascaded center point structure. It can obtain the final detection result through cascaded heat maps prediction, center point deviation prediction, as well as width and height prediction. Given appropriate training strategies, our proposed SCAN structure can alleviate the imbalance problem in detecting different scale objects, and the proposed cascaded center point structure outperforms original single-center-point based network. Experimental results demonstrate that under a unified training and testing environment, our proposed MSCCPNet is effective and competitive when compared with several representative object detection algorithms. Our proposed method improves detection accuracy while ensuring detection speed. However, it still faces some challenges in occupying storage space, affecting the performance of industrial applications. To address the above challenges, we will construct a new method for object detection tasks by the multi-resolution technology in the future.

Data availability

The PASCAL VOC dataset is publicly available: http://host.robots.ox.ac.uk:8080/pascal/VOC/. The COCO dataset is publicly available: https://cocodataset.org/.

References

Chen X, Yu J, Kong S, Wu Z, Wen L (2021) Joint anchor-feature refinement for real-time accurate object detection in images and videos. IEEE Trans Circuits Syst Video Technol 31(2):594–607. https://doi.org/10.1109/TCSVT.2020.2980876
Article Google Scholar
Wang H, Jiang L, Zhao Q, Li H, Yan K, Yang Y, Li S, Zhang Y, Qiao L, Fu C, Yin H, Hu Y, Yu H (2021) Progressive structure network-based multiscale feature fusion for object detection in real-time application. Eng Appl Artif Intell 106:104486. https://doi.org/10.1016/j.engappai.2021.104486
Article Google Scholar
Li Z, Lang C, Liang L, Zhao J, Feng S, Hou Q, Feng J (2021) Dense attentive feature enhancement for salient object detection. IEEE Trans Circuits Syst Video Technol. https://doi.org/10.1109/TCSVT.2021.3102944
Article Google Scholar
Bosquet B, Mucientes M, Brea VM (2020) STDnet: exploiting high resolution feature maps for small object detection. Eng Appl Artif Intell 91:103615. https://doi.org/10.1016/j.engappai.2020.103615
Article Google Scholar
Han X, He T, Ong Y-S, Zhong Y (2020) Precise object detection using adversarially augmented local/global feature fusion. Eng Appl Artif Intell 94:103710. https://doi.org/10.1016/j.engappai.2020.103710
Article Google Scholar
Dong Y, Tan W, Tao D, Zheng L, Li X (2021) Cartoonlossgan: learning surface and coloring of images for cartoonization. IEEE Trans Image Process 31:485–498
Article Google Scholar
Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision (ICCV), Santiago, Chile, pp 1440–1448
Tan J (2020) Complex object detection using deep proposal mechanism. Eng Appl Artif Intell 87:103234. https://doi.org/10.1016/j.engappai.2019.09.003
Article Google Scholar
Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Hawaii, USA, pp 936–944. https://doi.org/10.1109/CVPR.2017.106
Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031
Article Google Scholar
Cai Z, Vasconcelos N (2018) Cascade r-cnn: Delving into high quality object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Salt Lake City, USA, pp 6154–6162. https://doi.org/10.1109/CVPR.2018.00644
Duan K, Bai S, Xie L, Qi H, Huang Q, Tian Q (2019) Centernet: Keypoint triplets for object detection. In: Proceedings of the IEEE international conference on computer vision (ICCV), Seoul, South Korea, pp 6568–6577. https://doi.org/10.1109/ICCV.2019.00667
Law H, Deng J (2018) Cornernet: Detecting objects as paired keypoints. In: Proceedings of the European conference on computer vision (ECCV), Munich, Germany
Fu C-Y, Liu W, Ranga A, Tyagi A, Berg AC (2017) Dssd: deconvolutional single shot detector. arXiv preprint arXiv:1701.06659
Zhou C, Yuan J (2020) Occlusion pattern discovery for object detection and occlusion reasoning. IEEE Trans Circuits Syst Video Technol 30(7):2067–2080. https://doi.org/10.1109/TCSVT.2019.2909982
Article Google Scholar
Duan K, Du D, Qi H, Huang Q (2020) Detecting small objects using a channel-aware deconvolutional network. IEEE Trans Circuits Syst Video Technol 30(6):1639–1652. https://doi.org/10.1109/TCSVT.2019.2906246
Article Google Scholar
Cao Z, Hidalgo G, Simon T, Wei S-E, Sheikh Y (2021) Openpose: realtime multi-person 2d pose estimation using part affinity fields. IEEE Trans Pattern Anal Mach Intell 43(1):172–186. https://doi.org/10.1109/TPAMI.2019.2929257
Article Google Scholar
Zhou X, Wang D, Krähenbühl P (2019) Objects as points. arXiv preprint arXiv:1904.07850
Tong K, Wu Y (2022) Deep learning-based detection from the perspective of small or tiny objects: a survey. Image Vis Comput 123:104471. https://doi.org/10.1016/j.imavis.2022.104471
Article Google Scholar
Dong Y, Shen L, Pei Y, Yang H, Li X (2023) Field-matching attention network for object detection. Neurocomputing 535:123–133
Article Google Scholar
Dong Y, Zhao K, Zheng L, Yang H, Liu Q, Pei Y (2023) Refinement co-supervision network for real-time semantic segmentation. IET Comput Vis
Liu Q, Dong Y, Li X (2023) Multi-stage context refinement network for semantic segmentation. Neurocomputing 535:53–63
Article Google Scholar
Tao H, Cheng L, Qiu J, Stojanovic V (2022) Few shot cross equipment fault diagnosis method based on parameter optimization and feature mertic. Meas Sci Technol 33(11):115005
Article Google Scholar
Shen L, Tao H, Ni Y, Wang Y, Vladimir S (2023) Improved yolov3 model with feature map cropping for multi-scale road object detection. Meas Sci Technol
Zhuang Z, Tao H, Chen Y, Stojanovic V, Paszke W (2022) An optimal iterative learning control approach for linear systems with nonuniform trial lengths under input constraints. IEEE Trans Syst Man Cybern Syst
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Berg AC (2016) Ssd: Single shot multibox detector. In: Proceedings of the European conference on computer vision (ECCV), Amsterdam, Netherlands, pp 21–37
Redmon J, Farhadi A (2017) Yolo9000: better, faster, stronger. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Hawaii, USA, pp 6517–6525. https://doi.org/10.1109/CVPR.2017.690
Liang X, Zhang J, Zhuo L, Li Y, Tian Q (2020) Small object detection in unmanned aerial vehicle images using feature fusion and scaling-based single shot detector with spatial context analysis. IEEE Trans Circuits Syst Video Technol 30(6):1758–1770. https://doi.org/10.1109/TCSVT.2019.2905881
Article Google Scholar
Novoselov A, Dyakov O, Kostromin I, Pogibelskiy D (2019) Cascade multi-scale object detection on high-resolution images. In: 2019 International conference on engineering and telecommunication (EnT), pp 1–4. https://doi.org/10.1109/EnT47717.2019.9030548
Li J, Liang X, Wei Y, Xu T, Feng J, Yan S (2017) Perceptual generative adversarial networks for small object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Hawaii, USA, pp 1222–1230
Liu Z, Fang W, Sun J (2021) Ssd small object detection algorithm based on feature enhancement and sample selection. In: International symposium on distributed computing and applications for business engineering and science (DCABES), pp 96–99. https://doi.org/10.1109/DCABES52998.2021.00031
Liang X, Zhang J, Zhuo L, Li Y, Tian Q (2020) Small object detection in unmanned aerial vehicle images using feature fusion and scaling-based single shot detector with spatial context analysis. IEEE Trans Circuits Syst Video Technol 30(6):1758–1770. https://doi.org/10.1109/TCSVT.2019.2905881
Article Google Scholar
Li Z, Peng C, Yu G, Zhang X, Deng Y, Sun J (2018) Detnet: design backbone for object detection. In: Proceedings of the European conference on computer vision (ECCV), Munich, Germany
Boroumand M, Chen M, Fridrich J (2019) Deep residual network for steganalysis of digital images. IEEE Trans Inf Forens Secur 14(5):1181–1193. https://doi.org/10.1109/TIFS.2018.2871749
Article Google Scholar
Costilla-Reyes O, Vera-Rodriguez R, Scully P, Ozanyan KB (2019) Analysis of spatio-temporal representations for robust footstep recognition with deep residual neural networks. IEEE Trans Pattern Anal Mach Intell 41(2):285–296. https://doi.org/10.1109/TPAMI.2018.279984
Article Google Scholar
Wu Z, Shen C, Van Den Hengel A (2019) Wider or deeper: revisiting the resnet model for visual recognition. Pattern Recogn 90:119–133
Article Google Scholar
Bodla N, Singh B, Chellappa R, Davis LS (2017) Soft-NMS-improving object detection with one line of code. In: Proceedings of the IEEE international conference on computer vision (ICCV), Venice, Italy
Dong Y, Jiang Z, Tao F, Fu Z (2022) Multiple spatial residual network for object detection. Complex Intell Syst:1–16
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, USA, pp 770–778. https://doi.org/10.1109/CVPR.2016.90
Ge Z, Liu S, Wang F, Li Z, Sun J (2021) Yolox: exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L et al (2019) Pytorch: an imperative style, high-performance deep learning library. Adv Neural Inf Process Syst 32
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
Lin T-Y, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision (ICCV), Venice, Italy, pp 2999–3007. https://doi.org/10.1109/ICCV.2017.324
Everingham M, Gool LV, Williams CKI, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. Int J Comput Vis 88(2):303–338
Article Google Scholar
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T (eds) Proceedings of the European conference on computer vision (ECCV), Zurich, Switzerland, pp 740–755
Sun X, Xiao B, Wei F, Liang S, Wei Y (2018) Integral human pose regression. In: Proceedings of the European conference on computer vision (ECCV), Munich, Germany, pp 529–545
Xiao B, Wu H, Wei Y (2018) Simple baselines for human pose estimation and tracking. In: Proceedings of the European conference on computer vision (ECCV), Munich, Germany, pp 466–481
Chen L-C, Hermans A, Papandreou G, Schroff F, Wang P, Adam H (2018) Masklab: Instance segmentation by refining object detection with semantic and direction features. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Salt Lake City, USA, pp 4013–4022 . https://doi.org/10.1109/CVPR.2018.00422
Chen K, Pang J, Wang J, Xiong Y, Li X, Sun S, Feng W, Liu Z, Shi J, Ouyang W et al (2019) Hybrid task cascade for instance segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Long Beach, USA, pp 4974–4983
Uijlings JR, Van De Sande KE, Gevers T, Smeulders AW (2013) Selective search for object recognition. Int J Comput Vis 104(2):154–171
Article Google Scholar
Tan M, Pang R, Le QV (2020) Efficientdet: Scalable and efficient object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Seattle, USA, pp 10778–10787. https://doi.org/10.1109/CVPR42600.2020.01079

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant 62071171, and in part by the Natural Science Foundation of Henan under Grant 232300421023.

Author information

Authors and Affiliations

School of Information Engineering, Henan University of Science and Technology, Luoyang, 471023, China
Zhiqiang Jiang, Yongsheng Dong, Yuanhua Pei, Lintao Zheng, Fazhan Tao & Zhumu Fu

Authors

Zhiqiang Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Yongsheng Dong
View author publications
You can also search for this author in PubMed Google Scholar
Yuanhua Pei
View author publications
You can also search for this author in PubMed Google Scholar
Lintao Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Fazhan Tao
View author publications
You can also search for this author in PubMed Google Scholar
Zhumu Fu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yongsheng Dong.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Jiang, Z., Dong, Y., Pei, Y. et al. Multiple space based cascaded center point network for object detection. Complex Intell. Syst. 9, 7213–7225 (2023). https://doi.org/10.1007/s40747-023-01102-7

Download citation

Received: 11 February 2023
Accepted: 01 May 2023
Published: 23 June 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s40747-023-01102-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Multiple space based cascaded center point network for object detection

Abstract

Similar content being viewed by others

An Improved Object Detection Algorithm Based on CenterNet