SiamCPN: Visual tracking with the Siamese center-prediction network

Object detection is widely used in object tracking; anchor-free object tracking provides an end-to-end single-object-tracking approach. In this study, we propose a new anchor-free network, the Siamese center-prediction network (SiamCPN). Given the presence of referenced object features in the initial frame, we directly predict the center point and size of the object in subsequent frames in a Siamese-structure network without the need for perframe post-processing operations. Unlike other anchor-free tracking approaches that are based on semantic segmentation and achieve anchor-free tracking by pixel-level prediction, SiamCPN directly obtains all information required for tracking, greatly simplifying the model. A center-prediction sub-network is applied to multiple stages of the backbone to adaptively learn from the experience of different branches of the Siamese net. The model can accurately predict object location, implement appropriate corrections, and regress the size of the target bounding box. Compared to other leading Siamese networks, SiamCPN is simpler, faster, and more efficient as it uses fewer hyperparameters. Experiments demonstrate that our method outperforms other leading Siamese networks on GOT-10K and UAV123 benchmarks, and is comparable to other excellent trackers on LaSOT, VOT2016, and OTB-100 while improving inference speed 1.5 to 2 times.


Introduction
Single-object tracking is a fundamental problem in visual media processing. It is widely used in applications requiring location and appearance characteristics (shape, color, etc.) of targets, such as interactive visual media editing, intelligent monitoring, human-computer interaction, augmented reality, etc. In general, single-object tracking aims to find a target, marked in the first frame, in subsequent frames of a video or image sequence. By modeling the appearance and movement of the target, the tracker can predict its motion to estimate the position of the target. In particular, such a tracker can track any object without specifying the object's category by learning essential information related to the target, such as its appearance and spatial extent. However, widespread interfering factors, such as strong illumination changes, severe deformation of non-rigid objects, similar backgrounds, and occlusion, bring considerable challenges to this task.
Despite these difficulties, many excellent visual object tracking algorithms [1][2][3][4] have emerged. Among them, tracking by Siamese networks has attracted much attention in recent years [5][6][7][8]. A Siamese network of shared parameters receives two inputs for feature extraction: one branch marks out the template target region, while the other branch is used for search. After performing feature extraction through the deep backbone network, the task of finding the target object becomes one of calculating the similarity of the two output feature maps. Usually, cross correlation is used to do so. The features after cross-correlation generate a fixed-size response map whose peak is regarded the position of the target object. SINT [8] and SiamFC [5] first used this approach to solve the singleobject tracking problem. SiamRPN [7] improved the performance of SiamFC [5] by introducing a region proposal network [9]. Using the Siamese network structure, foreground-background classification and bounding box regression can also be performed on the proposed region, which can effectively improve the accuracy of the predicted bounding box, avoid the multiscale test in SiamFC [5], and achieve state-ofthe-art performance on multiple benchmarks. In later research, SiamRPN++ [6], DaSiamRPN [10], and SiamDW [11] improved tracking performance via the backbone network structure, residual block structure, sampling strategy, and in other ways. However, all of these approaches had relied on a predefined configuration of anchors. RPN-based models use multi-channel response maps to detect and regress region proposals, in which the number of channels in the output response map depends on the preconfigured anchors.
Furthermore, the existence of an anchor generates a large number of redundant prediction boxes and thus requires additional post-processing procedures such as non-maximum suppression to eliminate candidate boxes to obtain the final result, which also increases the calculation. On the basis of semantic segmentation theory, some researchers have recently improved these defects via pixel-level prediction, and perform object tracking in an anchor-free manner [12][13][14]. FCAF [14] suggested using an anchorfree proposal network (AFPN) to replace the region proposal network. The AFPN network consisted of a correlation section and a supervised section with two branches, one for classification and the other for regression. To suppress prediction of lowquality bounding boxes, a centricity branch was added, similar to that in SiamCAR [12]. However, as SiamCAR performs classification at the pixel level, mapping the predicted position back to the original image may cause deviations that can result in jitter during tracking. Therefore, after obtaining the prediction results of multiple adjacent pixels in the target area and upsampling the response map, the prediction results of multiple adjacent points are weighted and averaged to give the final target box. However, this post-processing procedure increases the computational burden during tracking. Moreover, although anchor-free approaches can simplify the region proposal module used in anchor-based trackers, post-processing is still needed because the outputs of the networks are based on semantic segmentation form.
As an alternative to the above methods, we propose a Siamese center prediction network (SiamCPN) based on keypoint detection by predicting the position and size of the target region in a "real" end-to-end manner. It uses a multi-channel heatmap in which one channel is used to predict the target position while the other two channels are used to adjust the center offset and regress the object size. In this manner, all of the information required for tracking can be directly obtained without any post-processing, thus greatly simplifying the model. A center-prediction sub-network (CPN) is applied to multiple stages of the backbone as a means of adaptively correlating the feature maps from the Siamese network. The outputs of SiamCPN are the directly predicted objects; no post-processing procedure is needed. Our main contributions are as follows: • SiamCPN, a network for single-object tracking that can be implemented in a simple, true end-toend manner. A few channels of the response maps are learned to directly predict the center and size of the target region, thus achieving anchor-free tracking. • A CPN to adaptively correlate multistage outputs from the backbone. • A demonstration that SiamCPN has superior performance on multiple datasets and is competitive in terms of inferencing speed to other methods selected in this work.

Related work
This section mainly focuses on tracking approaches based on Siamese networks. Tao et al. [8] proposed SINT and pointed out that the object-tracking problem could be converted into a matching process between a template frame and other frames. By using a Siamese network that could accept two inputs at the same time, SINT learnt a matching function between different regions in the two input frames. After obtaining target information from the first frame, all following frames could be fed into the network to calculate their similarity with the target in the first frame. However, this method required inputs to generate several region proposals in the image before passing data through the network, which was time-consuming. Bertinetto et al. [5] proposed SiamFC, further defining the tracking problem as a similarity learning problem, thereby obtaining a single-channel score map for object detection. SiamFC [5] quickly gained researchers' attention owing to its simple architecture, high accuracy, and high speed; it only requires offline training without online fine tuning.
Following these initial approaches, functional modules from related research have been applied to visual tracking by Siamese networks [6,7,[10][11][12][15][16][17]. Li et al. [7], who proposed SiamRPN, combined the RPN network from Faster R-CNN [9] with the Siamese network. SiamRPN replaced multiscale detection in SiamFC by means of bounding box regression, improving inference speed and accuracy. SiamRPN also adopted the idea of one-shot learning. During tracking, the template patch in the first frame could be fed into the template branch as the detection kernel and then used to perform a crosscorrelation operation with the features of the search region in subsequent frames for tracking. Wang et al. [15] proposed the SiamMask network that could simultaneously perform object tracking and segmentation based on a Siamese network by adding a mask branch for heatmap prediction to achiev object segmentation. Zhu et al. [10] argued that methods based on a Siamese network could only distinguish the target and the background when no semantic relationship exists. When similar-looking backgrounds and objects occur, the setup usually does not work well. Furthermore, a tracker based on a Siamese network cannot update a model online during the tracking stage, which can lead to accuracy loss. In addition, certain trackers cannot deal with the challenges of occlusion and target drawing in scenes during long-term tracking. In response to the above three problems, Zhu et al. [10] introduced DASiamRPN with high-quality training data for training. Existing datasets were used for object detection to enrich positive samples and difficult negative samples to improve the generalization and discrimination abilities of the tracker, respectively. A perception module was also introduced to improve the choice criterion for the optimal boundary.
When researchers replaced AlexNet [18] with a deeper convolution network for feature extraction based on the Siamese network structure, they discovered the problem of location bias [6,11], suggesting that the earlier works like SiamFC and SiamRPN could only use shallow networks for feature extraction. Zhang and Peng [11] analyzed the three factors of stride size, padding, and receptive field in convolutional networks. After several experiments, they found that the existence of padding in a deep network would cause tracking position deviation, and thus, that stride should be made as small as possible (8 is recommended). Furthermore, the size of the receptive field and the output stride should be considered at the same time.
On the basis of such observations, Zhang and Peng proposed SiamDW and adopted a new residual module to reduce the impact of padding. Li et al. [6] also explored the abovementioned problems and argued that a Siamese network could not use a deep network structure because of its lack of strict translational invariance; moreover, padding could destroy translation invariance. The sampling strategy was improved by transforming the original fixed position sampling to uniform sampling near the center. They trained a Siamese network tracker using ResNet [19] as the backbone network. Compared with previous work, the performance of the tracker was notably improved. Apart from maintaining real-time performance (35 frames per second), SiamRPN++ [6] achieved excellent scores in terms of expected average overlap rate, robustness, and accuracy.
However, anchor-based methods not only require several experiments to determine suitable hyperparameters but also need tedious postprocessing operations.
Recently, some works based on semantic segmentation have achieved pixel-by-pixel object tracking in an anchor-free and proposal-free manner [12][13][14]. On the basis of keypoint detection theory [20][21][22][23][24], only the center point of the bounding box and other information are used to predict and correct the position and size of the target. This approach allows our SiamCPN to operate faster and perform better while using the same feature extraction strategy as Refs. [6,12]. Furthermore, our proposed method is more concise and effective in exploring an advanced and convenient object-tracking solution than other methods.

Overview
The overall structure of our SiamCPN is shown in Fig. 1. Features are extracted by the Siamese fully convolutional backbone. Multiple CPNs are used to measure the similarity of the outputs from different stages of the Siamese feature extraction backbone. The final result is obtained by enhancing the average weighted outputs of these multi-CPN modules. In this section, we discuss the overall structure of the proposed SiamCPN (Section 3.2) and then explore the CPN (Section 3.3) and the loss functions for training the SiamCPN (Section 3.4).

Siamese center prediction network
In SiamCPN, a modified ResNet-50 is used as the backbone to build a fully convolutional network for feature extraction. The stride of the network is reduced and its receptive field is increased simultaneously via dilated convolution to ensure the spatial consistency of conv4 and conv5.
Tracking algorithms based on a Siamese network usually obtain input from two branches called the template branch and the search branch. As shown in Fig. 1, the network branch input by template Z is a template branch, and the network branch corresponding to another input X is a search branch. The template branch takes a specified template patch Z in the first frame as the input, whereas the search branch takes the search region X as the input. These two inputs are fed into a shared-parameter CNN to generate output feature maps ϕ(Z) and ϕ(X). Then, the similarity response of the two different output feature maps ϕ(Z) and ϕ(X) is calculated by crosscorrelation. Finally, the output response map passes through the CPN head to generate multiple response maps, given by where CPN denotes the center-prediction subnetwork. The CPN calculates the cross correlation between the channel-aligned feature F * b (X) and F * b (Z), which come from block b of the backbone network. The CPN adaptively generates a singlechannel or multichannel heatmap H.
Low-level features better representing visual attributes (such as edges, corners, colors, and shapes) are essential in predicting object positions, whereas high-level features can better represent semantic attributes that are essential for making a distinction. Therefore, we also consider the use of multistage features for tracking. Here, we use features extracted from the last three residual blocks of the backbone, denoted F 3 ( * ), F 4 ( * ), F 5 ( * ), where * represents the template patch Z or search region X. Before cross-correlation, the channel sizes of F 3 ( * ), F 4 ( * ), F 5 ( * ) should be unified (to 256 in our experiments). Thus, a convolutional layer, with kernel size 1 × 1 for adjustment, is appended to these three blocks. As shown in Fig. 1, the unified features F * 3 ( * ), F 4 ( * ), F * 5 ( * ), generated by block3, block4, and block5, respectively, are adopted as the inputs to the multiple CPN module.
The main output of our approach is a heatmap Y ∈ [0, 1] w/r×h/r×1 , where w and h are the width and height of X, and r is the output stride. We set w = h = 255. WhenŶ x,y = 1, the corresponding position (x, y) is regarded as the detected center point position; otherwise, it is the background. In addition, to correct positional deviation due to the span of the network during the learning process, we predict the center offset to more accurately regress the position.

Center prediction sub-network
Given the unified feature maps F * b (X) and F * b (Z) from the two branches, the CPN adaptively calculates the cross correlation and outputs a heatmap of the center, corresponding offset, and size of the object. A self-adapted block, depth-wise cross correlation, and a prediction head are used in the proposed CPN.

Self-adapted block
To effectively fuse features from two branches for the final prediction, we propose a self-adapted block whose parameters are not shared to solve the varying problem in each prediction branch. In particular, features from the template and search branches are first passed through different convolutional layers. Then, the center region is cropped from the feature of the template branch to reduce the computational burden on the cross-correlation operation. The cropped center size is set to 7 to preserve accurate information about the object. Then, the template branch is passed through a group convolutional layer, for computational reasons. Unlike the template branch, the search branch only needs to append another group convolutional layer. In general, the selfadapted block allows the modules in each branch to acquire enough meaningful knowledge during training to improve prediction. Figure 1(below) shows details of the sampled block. We only show part of a prediction branch (location, offset, or size) for a given CPN module. Three similar parts are used to obtain the different CPN outputs.

Depth-wise correlation
Cross-correlation is the core operation during tracking, and the goal is to determine the most similar patches from the search region in the semantic embedding space.
where * denotes depth-wise correlation, which is used to generate the multichannel response map R. To efficiently achieve information association, we use depth-wise cross correlation to embed the information from the two branches. Then, the calculation is performed in a channel-by-channel manner. Each channel of R represents different meaningful semantic information, which can then be used to predict the target-related attributes. The CPN head takes R into a convolutional layer with normalization and outputs three 25 × 25 heatmaps with one, two, and two channels, respectively.

Objective
As the desired output is in the form of a heatmap, the ground truth is built in the same format. First, we generate the corresponding center coordinates p = ((x 1 + x 2 )/2, (y 1 + y 2 )/2) in the original image. Then, we obtain the corresponding center coordinates p = p/R in the downsampled feature map. Finally, the keypoints in the feature map are distributed in the form of Gaussian kernels for the labeled bounding box: where σ p is a standard deviation related to target size, a scheme similar to that in Ref. [24]. By overlaying the Gaussian distribution on the heatmap, Gaussian keypoints can be continuously added based on the heatmap. The training objective is a penalty-reduced pixel-wise logistic regression with focal loss [25]: where N is the number of keypoints in the search region. Following the scheme in Ref. [24], we set α = 2 and β = 4 in the experiments.
To reduce the impact of position shift caused by downsampling, an offset branch is added to predict the deviation of the center point; we use L 1 loss for training: which only works at the location of the center point predicted by the heatmap, whereas all other places are ignored. The outputÔp ∈ R w/r×h/r×2 contains two channels for offsets in the w and h directions, respectively. A prediction of the relevant attributes of the target center is insufficient during tracking, and target size information also needs to be obtained. After estimating the location of the center using a heatmap, we directly regress the width and height of the object by using the L 1 loss at the center as follows: whereŜ ∈ R w/r×h/r×2 contains two channels for width and height of the object. The overall training objective is expressed as where the constants λ off and λ wh weight the offset loss and size loss, respectively. During training, we set λ off = 1 and λ wh = 0.1 for the experiments. Finally, the average of the outputs of the three CPN modules is calculated from multiple stages as the overall prediction. Thus, SiamCPN decomposes the tracking problem into three subproblems: determining the location of the object center, predicting the center offset, and regression of object size. Combining these multilevel features enhances the capabilities of the CPN module and allows it to obtain good predictions.

Implementation
SiamCPN was implemented in Python with PyTorch and trained on 4 TITAN X GPUs. To enable a fair comparison, the input sizes of the template patch and search regions were set in the same manner as in Refs. [6] and [7], i.e., 127 × 127 and 255 × 255 pixels. The backbone was pretrained on ImageNet [26].

Training
We conducted training using six large datasets: GOT-10K [27], LaSOT [28], COCO [29], DET [26], VID [26], and YouTube-BB [30]. During training process, we set the batch size to 32 and used stochastic gradient descent for optimization. In general, SiamCPN was trained for 20 epochs. The first 10 epochs are only used for preliminary training and excluded the last three blocks for the backbone network. The last three blocks of ResNet-50 were added for training in the remaining epochs. To ensure a fair comparison, training and evaluation on GOT-10K [27] and LaSOT [28] were conducted separately in accordance with SiamCAR [12], whereas training on the other four datasets was conducted for the evaluations on OTB [31,32], VOT2016 [33], and UAV123 [34].

Testing
We implemented an offline tracking strategy for testing. The template branch is only computed once in the first frame and then fixed over the whole tracking period. As a result, the object in the first frame is adopted as the template patch for tracking, and the current frame is adopted as the search region that is fed into the backbone network. In general, the purpose of the inference process is to extract the required bounding box from the generated heatmap. Therefore, after the input of the two branches is passed through the SiamCPN, the position of the peak in the heatmap is considered to be the location of the object. Then, we adjust the position of the center point by using the predicted offset and determine the final box by referring to the center point and predicted object size. For evaluations on different datasets, a comparison with authors' own measurements was conducted. As availability of data for different methods varies, comparison with different state-of-the-art methods is conducted using different benchmarks.

Comparison with state-of-the-art
In our experiments, we found the proposed method to be faster than other methods and easier to train, test, and deploy. It can be adapted without introducing additional hyperparameters. After training the model, it was tested directly on different benchmarks. SiamCPN outperforms existing methods on relevant benchmarks while maintaining its speed advantage and only needs simple test conditions.

Assessment using GOT-10K
The GOT-10K [27] dataset contains more than 10,000 video segments of real-world moving objects and over 1.5 million manually labeled bounding boxes. The dataset has the WordNet [45] backbone and covers a majority of the 560+ classes of real-world moving objects and 80+ classes of motion patterns. The test set embodies 84 object classes and 32 motion classes with 180 video segments, allowing for efficient evaluation. For a fair comparison, the protocol for deep trackers was used so that all approaches could use the same training data provided by the dataset. The primary evaluation indicators for GOT-10K were the average overlap (AO) and success rate (SR). AO represents the average overlap of all estimated boxes and ground-truth boxes. SR includes SR 0.5 and SR 0.75 , which represents the rate of successfully tracked frames whose overlap exceeds 0.5 and 0.75, respectively.
A comparison was conducted to the baselines provided by the GOT-10K website, including to Siamese-based approaches, such as SiamRPN++ [6] and SiamFC [5]. To show the effectiveness of the proposed CPN based on the anchor-free strategy, a comparison was also made to other three tracking methods [12,13,16], selected on the basis of their anchor-free tracking strategies. using released models and code. As Fig. 2 shows, SiamCPN outperforms the other trackers. Table 1 gives detailed results using different metrics. AP and SR for OCEAN [13] are much lower than those listed in Ref. [13], perhaps due to choice of unpublished hyper-parameters (e.g., window penalty) that need careful selection for each test set. By contrast, when testing our model with a specific test set, fine tuning of parameters is not required: our method uses fewer hyperparameters and is more convenient to use than the other methods investigated in this study.  Inferencing speed is an important factor in assessing model performance. Table 2 shows tracking frame rates in fps, for different approaches: the following four approaches were tested under the same conditions (using a Titan X GPU): SiamFC [5], SiamRPN++ [6], SiamCAR [12], and our method. SiamRPN++ and SiamCAR were selected as they adopt the same feature extraction strategy as SiamCPN; SiamFC uses a shallow backbone network [18] for feature extraction and is commonly considered a fast single-object tracker. The inferencing speeds of SiamRPN++ and SiamCAR are 19.05 and 17.7 fps, respectively, whereas the proposed SiamCPN reaches 33.79 fps. Using the same strategy for feature extraction with the same hardware, our method is faster than SiamCAR and SiamRPN++. Furthermore, SiamCPN is faster than SiamFC. These findings indicate that the proposed CPN has excellent performance with regard to inferencing speed.

Assessment using LaSOT
The LaSOT [28] dataset contains more than 3.52 million frames of hand-labeled pictures and 1400 videos, and is by far the largest single-target tracking dataset with dense labeling. On average each LaSOT sequence has 2512 frames, all carefully checked and manually marked. Thus, approximately 3.52 million high-quality bounding box annotations can be generated. LaSOT contains 70 categories, each having 20 sequences. To assess existing trackers and provide a broad benchmark for future comparisons using LaSOT, 35 representative trackers were evaluated under different protocols, and their performances are analyzed using different metrics. Figure 3 shows the success and precision plots using LaSOT. A comparison was conducted with the top 15 trackers, including SiamRPN++ [6], MDNet [40], DSiam [46], and others. The results obtained are comparable with those from SiamRPN++ but performs better than those for the other baseline methods. The ability of our model to outperform most selected methods using a large-scale dataset shows that our method is feasible and effective.

Assessment using VOT2016
The VOT2016 [33] dataset includes 60 video sequences with different challenging factors for evaluating tracking performance. It also includes two basic evaluation indicators (accuracy rate and robustness) and combines them into EAO (expected average overlap) as the overall performance evaluation metric. The accuracy rate corresponds to the AO rate under successful tracking, while robustness is measured on the basis of the total number of tracking failures. To test the effectiveness and stability of our proposed anchor-free strategy, we set up comparative experiments integrating different trackers, including FCAF [14] based on an anchor-free strategy and use of semantic segmentation for object tracking. As shown in Table 3, our model outperforms the other methods for all metrics selected in this study.

Assessment using OTB-100
The OTB-100 [32] dataset was developed from OTB-50 [31] dataset, which consists of 50 fully annotated video sequences, containing a total of 51 objects of different sizes, and more than 29,000 frames. Each target is affected by various interfering factors during the tracking process. To fully evaluate the robustness of tracking algorithms with respect to various factors that may affect tracking, OTB50 was used to provide 11 common video attribute annotations: illumination changes, scale changes, occlusion, deformation, motion blur, fast motion, in-plane rotation, out-of-plane rotation, out of view, background interference, and low resolution. Each video frame contains at least two attribute annotations. In addition, the OTB-50 dataset integrates 29 popular tracking algorithms and unifies input and output formats to facilitate algorithm performance evaluation. In 2015, the OTB-50 dataset was further expanded to the OTB-100 dataset with 100 labeled video sequences. We compare our method with the top 9 baselines, including MUSTer [47], MEEM [48], STRUCK [49], and other methods whose tracking results are provided by the OTB website. As Figs. 4 and 5 show, our SiamCPN outperforms all other methods in terms of both metrics.

Fig. 4
Precision evaluation on OTB-100 [32]. Our approach is superior to the comparators.

Fig. 5
Success evaluation on OTB-100 [32]. Our approach is superior to the comparators.

Assessment using UAV123
The UAV123 [34] dataset contains a total of 123 video sequences and more than 110k frames. All sequences are fully annotated with upright bounding boxes. We compared our method to 14 baselines provided by the UAV123 website, including MUSTer [47], SRDCF [37], MEEM [48], and other approaches. Success and precision of OPE were used to evaluate the overall performance in this study. As shown in Figs. 6 and 7, our SiamCPN outperforms all other trackers on both metrics. In addition, as shown in Table 4, SiamCPN provides the best results while using a much simpler network than state-of-the-art RPN-based trackers, and it does not require heuristic parameter tuning.   Fig. 7 Success evaluation on UAV123 [34]. Our method is more accurate than the baseline and state-of-the-art approaches.

Conclusions
In this study, we decomposed the object-tracking problem into three sub-problems to predict center position, center point offset, and object-size. Our proposed SiamCPN can be treated as an encodingdecoding framework. By ensuring feature extraction and correlation calculation in CPN, the differences of the two input frames can be encoded into the response maps. The CPN head can also decode the response maps into heatmaps for visual tracking. The proposed method is simpler and faster than many other Siamese-based methods and achieves excellent performance on various large-scale datasets such as GOT-10K and LaSOT. Our research provides a new approach for Siamese networks when combined with the anchor-free detection method. In future, we will continue to explore the potential of Siamese networks in tracking. Specifically, we will focus on enriching the expressive ability of template branches by extracting more powerful features, and finding target-related information from high-level semantics. However, it is difficult to solve various challenges in real scenes by relying only on visual features. Incorporating temporal information into the model will make it more robust. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.