Robust monocular object pose tracking for large pose shift using 2D tracking

Monocular object pose tracking has been a key technology in autonomous rendezvous of two moving platforms. However, rapid relative motion between platforms causes large interframe pose shifts, which leads to pose tracking failure. Based on the derivation of the region-based pose tracking method and the theory of rigid body kinematics, we put forward that the stability of the color segmentation model and linearization in pose optimization are the key to region-based monocular object pose tracking. A reliable metric named VoI is designed to measure interframe pose shifts, based on which we argue that motion continuity recovery is a promising way to tackle the translation-dominant large pose shift issue. Then, a 2D tracking method is adopted to bridge the interframe motion continuity gap. For texture-rich objects, the motion continuity can be recovered through localized region-based pose transferring, which is performed by solving a PnP (Perspective-n-Point) problem within the tracked 2D bounding boxes of two adjacent frames. Moreover, for texture-less objects, a direct translation approach is introduced to estimate an intermediate pose of the frame. Finally, a region-based pose refinement is exploited to obtain the final tracked pose. Experimental results on synthetic and real image sequences indicate that the proposed method achieves superior performance to state-of-the-art methods in tracking objects with large pose shifts.


Introduction
Monocular object pose tracking aims to estimate both the 3D rotation and the 3D translation of a rigid object relative to the camera from successive frames [1].Among the tracking methods, region-based pose tracking has achieved state-of-the-art performance.Such approaches have been extensively studied in recent decades, and employed in applications such as robot grasping, augmented reality, human-computer interaction, and medical navigation [2].Due to its low cost, ease of implementation and good electromagnetic resistance, region-based pose tracking is also a key technology for airborne vision guidance of unmanned aerial vehicles' (UAVs') autonomous landing.
Generally, an on-board camera can only work at a limited frame rate in practical applications, since it is con-strained by data transmission rates, limited computational resources and power capacity of on-board hardware.Large pose shifts between adjacent frames will emerge when jerky relative motion exists.From the perspective of rigid body kinematics, large pose shifts can be categorized as rotation-dominant, translation-dominant and composite cases, which exhibit different characteristics on images.Theoretically, region-based pose tracking methods first use a specific color segmentation model to establish a posterior probability-based energy function, and then an optimization algorithm is exploited to calculate the pose iteratively.To keep the stability of the segmentation model statistically, large interframe intersection of foregrounds is required.For correct convergence, the optimization must be initialized with a value very close to the ground truth, which means interframe pose (especially the rotation) variation being of a negligible scale.However, large interframe translation variations corrupt the stability of the segmentation model, and the large rotation deriva-tions violate the negligible pose variation assumption in optimization, resulting in pose tracking failures.
In some scenarios, such as airborne vision guidance of UAVs' autonomous landing, large pose shifts are caused by erratic camera movements or rapid translations of the target, which are characterized by large interframe translation variations with limited rotation differences.In this paper, we propose using 2D tracking to tackle this problem.Elaborate analysis is first provided on how large pose shifts exert impacts on region-based pose tracking.Then, a novel metric named VoI, the ratio of shared visible vertices over intersection, is put forward to measure the interframe pose shifts.Substantial experiments show that monocular tracking becomes more difficult as the VoI decreases.Inspired by this, an efficient, robust 2D tracking method is exploited to increase the interframe VoIs and help bridge the motion continuity gap.For texture-rich objects, the intermediate pose can be calculated by solving a PnP (Perspective-n-Point) problem with a sufficient number of correspondence points within the tracked 2D bounding boxes.When there are matched points of an insufficient quantity, a direct translation is performed to transfer the pose to the current frame.Finally, the intermediate pose is refined using a region-based method.Evaluation experiments, using both synthetic and real images in comparison to recently published representative methods of [3][4][5], were performed extensively.The results indicate that the proposed method is able to achieve performance superior to state-of-the-art methods, especially for objects with large pose shifts.
To the authors' best knowledge, little previous work has been devoted to tracking objects with large pose shifts.The main contributions of this paper are listed as follows.
(1) A mathematical analysis of the impacts that large pose shifts exert on pose tracking is conducted, which are corruption of the color segmentation model and violation of pose linearization required by robust pose tracking.(2) A reliable metric to measure the interframe pose shifts is proposed and validated, which triggers the development of strategies for tackling translation-dominant large pose shift issues.(3) An approach using 2D tracking to bridge the pose shift gap is proposed and corresponding methods for texture-rich and texture-less objects are developed, respectively.The remainder of this paper is organized as follows.Section 2 provides a review of related work.In Sect.3, qualitative and quantitative analysis of the region-based pose tracking method for large pose shifts is performed.The proposed method is illustrated in detail in Sect. 4. Experiments are described in Sect. 5. Section 6 concludes the paper.

Related work
Monocular pose determination methods have been extensively studied in the past decades.Pose estimation algorithms attempt to identify an object's pose from a single image.Among the algorithms, most representative pointbased and line-based methods usually acquire poses by solving the PnP and PnL (Perspective-n-Lines) problem, respectively.Different from the estimation methods, pose tracking utilizes distinctive characteristics of an object and the interframe image information to recursively track its pose.Based on the characteristics employed, existing monocular pose tracking methods can be divided into three categories [6]: edge-based methods, direct methods, and region-based methods.Edge-based methods [7,8] usually sample the projected edge of the 3D model into sparse control points.The correspondence is searched along the normal direction of each sampled control point.When the sum of the distance between the control points and the corresponding points reaches a specified minimum, the estimated pose is acquired.Edge-based methods rely on strong edge features and are therefore prone to failure when an image has a cluttered background or contains interference such as image blurring and noise, which presents indistinguishable edges that may produce a local minimum in pose optimization.In this section we present a brief review of the direct methods, the regionbased methods and recently developed deep learningbased methods, which are most relevant to this paper.

Direct methods
Direct methods optimize the pose parameters by directly and densely aligning consecutive frames over a 3D object model to minimize the photometric error of the corresponding foreground pixels [9].Given a known 3D model of the target, researchers have expanded the classical Lucas-Kanade algorithm [10,11] from 2D image transformation to 3D object tracking.Related work includes plane tracking [12,13], 3D object tracking [14,15], and visual odometry [16,17].Direct methods are heavily dependent on the assumption of photometric constancy, and are therefore sensitive to issues such as dynamic illumination, noise and occlusion.To improve performance, Chen et al. [13] and Crivellaro et al. [14] replaced the raw pixel values with gradient orientation and a novel descriptor, respectively.Zhong et al. [18] directly aligned the images to dynamic templates rendered from a textured 3D object model to achieve better performance.With the help of the graphics processing unit (GPU), Pauwels et al. [19] proposed a novel multi-view pose tracking method, exploiting dense motion and depth cues with sparse keypoint correspondences to simultaneously track hundreds of objects in real time.Despite the improvement in robustness, these methods' reliance on motion continuity remains.Therefore, these methods cannot effectively cope with the problem of tracking objects with large pose shifts.

Region-based methods
With a known 3D model, combining 2D segmentation and 3D pose estimation problems, region-based methods track an object by searching for its pose that best segments the object from the background [5].PWP3D [20] is the best-known region-based approach and the first to achieve real-time performance using GPU acceleration.Its energy function is built on pixel-wise posterior foreground and background membership.Based on PWP3D, the following methods mainly seek improvements in the two aspects: pose optimization and the segmentation model.Tjaden et al. [21] replaced the first-order gradient descent optimization used in PWP3D with a novel Gauss-Newtonlike optimization strategy.The segmentation model is also changed from global to local in [22].Tjaden et al. [23] introduced a novel localized model using the temporally consistent local color histograms to preserve temporal consistency.In [3], the authors summarized their previous work [21,23] and introduced a novel iteratively reweighted Gauss-Newton optimization method.Region-based methods with localized models [3,5,9,21,[23][24][25] only use the pixels within a limited band along the projected object contour, and are therefore prone to failure when tracking symmetrical objects.Zhong et al. [9] introduced an approach combining direct and region-based methods by utilizing the pixels of foreground's interior.This method achieves more robust performance when dealing with contour pose ambiguities.Li et al. [2] developed adaptively weighted local bundles to alleviate the negative effects of features in low-confidence regions.Combining edge and region features, Sun et al. [5] proposed a novel contour part model to track less distinct objects.Liu et al. [26] suggested using simplified distance functions to achieve better efficiency.Inspired by [27], Stoiber et al. [1] put forward correspondence lines along pre-computed contour points to develop a sparse approach that is more efficient than previous region-based methods while achieving better tracking performance.A more detailed version of the work [1] is presented in [4].Region-based methods have achieved significant improvement in performance among monocular pose tracking approaches.However, since they make the assumption of limited motion variation, as direct methods do, they cannot achieve satisfactory results in large interframe pose shift scenarios.

Deep learning-based methods
Deep learning has achieved excellent results in many image applications.It has also been used in monocular pose tracking recently.According to whether the computer aided design (CAD) model of an object is available, deep learning-based pose tracking can be classified into instance-level and category-level methods.Instance-level methods usually adopt refinement or optimization techniques.Deep model-based 6D (DMB6D) [28] designs a novel loss by aligning object contours in the image and achieves pose tracking by refining the initial pose.Since the pose refinement is one single forward pass, the predicted pose of DMB6D is coarse.The deep iterative matching (DeepIM) model [29] replaces the single forward refinement with iterative matching between the rendered image and the current frame, achieving more accurate results.In contrast to the refinement strategies, the pose Rao-Blackwellized particle filter (PoseRBPF) [30] adopts an optimization scheme, fusing particle filtering with a learned auto-encoder network for updating the pose.The code book is obtained by discretizing the continuous rotation space in advance, which affects the time complexity linearly.Zhong et al. [31] proposed integrating learningbased segmentation with optimization-based pose estimation, and achieved good performance even when occlusion existed.Category-level methods do not need the CAD model as prior information, and usually perform pose tracking by detection or keypoints.Mono3D-tracking [32] utilizes 3D box depth-ordering matching for robust instance association and designs a motion learning module for accurate long-term motion extrapolation.Ahmadyan et al. [33] tracked the keypoints of 3D object bounding boxes and recovered the pose by solving the EPnP [34] problem.
In summary, despite their great potential for pose tracking, deep learning-based methods require a large amount of labeled data for training that may be not available in real applications, especially for sensitive targets.Considering efficiency, deep learning-based methods are usually inferior to traditional methods which can achieve the frame rate as high as 100 Hz.Therefore, in some specific practical applications, traditional methods are preferred to perform pose tracking.

Problem analysis
To fully understand the difficulty in object pose tracking caused by large pose shifts, qualitative and experimental analysis of the characteristics of large pose shifts are first performed in this section and a metric to measure the interframe pose shifts is proposed.

Region-based pose tracking with large interframe pose shifts
For region-based pose tracking, the pose of the first frame and the 3D object model, in the form of triangle meshes and vertices as shown in Fig. 1, are utilized as priori information.A vertex of the 3D model is denoted as  where SE(3) and SO(3) represent the special Euclidean group and the special orthogonal group, respectively.The camera is pre-calibrated with fixed intrinsic parameters: where (F x , F y ) and (C x , C y ) denote the focal length and the principal point in pixels, respectively.All images are assumed to be rectified by removing lens distortion.A specific 3D point can be projected to an image with a pose T C O , and the corresponding 2D image point can be attained by with π(P) = [x/z, y/z] .As proposed in [35], object contour in image can be represented in the form of level set function: Signed distance functions are usually used to quantify the level set function: where d(x) = min x c ∈C xx c 2 , and f and b represent the localized foreground and background, respectively.According to [36], the pose variation T can be modeled with twists ξ ∈ R 6 : where where so(3) denotes the set of all 3 × 3 skew-symmetric matrices.With the pose variation ξ , the pixel x evolves to a new position x(ξ ), as illustrated in Fig. 2. Specific smooth step functions h f and h b are exploited to quantify the φ(x), probabilistically.For example, the smooth step functions used in SRT3D [4] are based on the hyperbolic tangent function, as displayed in Fig. 3.The smooth step functions are adopted to quantitatively illustrate the fact that a foreground pixel from the previous frame, with a larger distance from the contour, will be more likely to stay in the foreground of the current frame, and vice versa.Based on Bayes formula and level-set region-based segmentation, the posterior probability for φ(x) can be quantitatively for- mulated as [20]: where , where P(y|m f ), P(y|m b ) are probability distributions that describe how likely it is that a specific pixel is part of the foreground region or the background region, respectively.They are formulated as P(y|m i ) = P(y, m i )/P(m i ), which are usually calculated through normalized color histograms in local regions.P(m f ) and P(m b ) are model priors that are calculated by the ratio of the area of the corresponding model region to the total area of both models, i.e.P(m i ) = i / , i ∈ {f , b}.The energy function is defined as With a specific optimization method, the interframe pose variation ξ will be acquired.The pose can be updated as T(t) = TT(t -1).From the above derivation, it can be inferred that the evolved correspondence contour point should stay in the local region for correct pose tracking.It is a basic assumption for region-based pose tracking.For stable color statistics, the local regions are limited by the size of the foreground.Therefore, when large in-plane displacement caused by large translation variation exists, the current contour will evolve beyond the scope of local regions, which will result in tracking failure.
From the perspective of rigid body kinematics, the relative motion between the camera and the object is smooth and continuous, and can be represented in the form of a function of time: According to the Rodrigues formula [36], we have R(t) = e ŵ(t) , where ŵ is the skew symmetric matrix determined by angular velocity w: The derivative of Equ.(11) writes where The velocity dt/dt is composed of the linear part v(t) caused by translation and the part ŵ(t)t caused by rotation: From Equ. ( 13), Equ. ( 14), and ( 15), the following formula holds: According to Equ. ( 7), Equ. ( 16) can be rewritten in the form of The general solution of this differential equation is With a small time variation dt, we have For iterative pose optimization, the variation e ξ should be linearized using Taylor's series: The linearization process of Equ. ( 20) is performed in each iteration of pose optimization.Only when the pose variation is of a negligible scale, the linearization will be accepted.Furthermore, the linearization of e ξ is essentially the approximation of interframe rotation variation, because R = e ŵ ≈ I 3×3 + ŵ according to the Rodrigues formula.It can be concluded that large interframe rotation variations violate the pose linearization process, and cause the failure of pose optimization mathematically.Unlike the rotation-dominant large pose shifts, translation-dominant large pose shifts cause visual instability of the segmentation model, so it is natural to consider that the problem can be remedied by some specific visual techniques.Based on this consideration, we propose using 2D tracking to tackle the translation-dominant large pose shift issue.

A novel metric on measuring the interframe pose shifts
For better understanding the impacts that large pose shifts exert on pose tracking, it is necessary to design a visual metric to evaluate the degree of interframe pose shifts, instead of ambiguous R and t.A well-known metric for the overlapping region is intersection over union ratio (IoU).
It has been widely used in object detection and segmentation.However, IoU cannot accurately measure the pose shifts for monocular pose tracking, especially when large rotation variations exist for symmetry objects.Since the interframe pose shifts describe the intensity of the relative motion between the camera and the object, we argue that a metric fusing the 2D and 3D information will be more robust for pose shift evaluation.According to Equ. ( 3), the projection of the 3D model is formed by the projections of the vertices.In [23], the authors attached statistical color models to the vertices of the 3D model to preserve the temporal consistency between successive frames.Inspired by [23], pose tracking can be regarded as the process of interframe correspondence identification using the projected vertices.From the perspective of image mapping, a sufficient number of corresponding vertices in the overlapping region are required to achieve robust monocular pose tracking.Based on the above considerations, we design a novel metric, the ratio of shared visible vertices over intersection (VoI) to measure the interframe pose shifts.Specifically, we denote the set of all vertices on the 3D model as V.The visible vertices projected into the foreground intersection in the previous and current frames form the sets V Pre and V Curr , respectively.The set of shared visible vertices V Shared contains identical vertices which are projected to the foreground intersection region in both the previous and current frames: V Shared = V Pre ∩ V Curr , as illustrated in Fig. 4. The VoI is defined as where card(A) denotes the number of elements in the set A. VoI reflects the proportion of correspondence points among the intersection area of two adjacent frames.

Experimental validation on VoI
Experiments using synthetic image sequences are conducted to evaluate the performance of VoI in measuring the pose shifts.For comparison with IoU, the axisymmetric model "Baking Soda" from the RBOT dataset [3] is adopted for generating the synthetic sequence.The "Squirrel" model, which has abundant vertices and relatively complex structure, as illustrated in Fig. 1 is chosen for VoI evaluation.Synthetic images are generated using the procedural Blender pipeline "BlenderProc" [37] with ground truth poses.To avoid the impact of cluttered backgrounds on pose tracking, the same gray background is used for all synthetic images.The reference frame settings for the camera and the object are displayed in Fig. 5. Evaluation experiments in three typical scenarios of translationdominant pose shifts, rotation-dominant pose shifts and composite motion, are further conducted.In addition, VoI is evaluated on the public dataset RBOT.The parameter settings for Experiment I ∼ IV are presented in Table 1.(α, β, γ ) represents the rotation, in degrees, around the where R(t) and t(t) represent the calculated pose result at time t with R gt and t gt as corresponding ground truths.
Pose tracking is considered successful only when the rotation error is less than 5°and the translation error is less than 5 cm.If the tracking is lost, the tracking process will be reinitialized with the ground truth pose.
Experiment I The model "Baking Soda" is set to spin around its axis of symmetry at an angular speed of π/3 per Experiment II The object is set to move along the X-axis of the object reference frame at a constant speed of 4 mm per frame to simulate the motion in which translation increases gradually.The rendering results are 501 synthetic images.By inserting the first frame in between all adjacent images, a sequence of 1001 frames are generated.
Experiment III Angular speeds of the object around three axes are set to be constant as 0.18 degrees per frame, simulating the object gradually rotating away from the initial state.The inserting operation that is same to Experiment II is conducted, which produces a 1001-frame sequence.
Experiment IV To simulate large pose shifts caused by composite motion, the object's pose parameters are set to vary randomly within the preset ranges, as presented in Table 1, and 10,001 synthetic images are generated.
Experiment V The RBOT dataset [3] is widely used in monocular pose tracking.It includes 18 different objects, as presented in Fig. 7.For each object, the dataset contains four sequences with complexity of different levels, namely "regular", "dynamic light", "noisy", and "occlusion" (see Fig. 8).Each sequence comprises 1001 synthetic images.The objects are set to move continuously with a same pre-defined trajectory in all sequences.Considering large pose shift is not covered in the RBOT dataset, we have extracted images at intervals of one or two frames from the original sequences to form two new datasets, which are referred to as the modified RBOT dataset A and modified RBOT dataset B, respectively.The "regular" sequences of 12 models with abundant vertices (>1100 vertices) from the modified RBOT datasets are tested in this section.The evaluation results on the whole modified RBOT datasets will be presented in Sect. 5.
For Experiment I, as illustrated in Fig. 9, although the IoUs remain high throughout the sequence (no less than 96%), the pose tracking fails on all frames due to large out-of-plane rotation variations.In contrast, VoI is low throughout the sequence (no higher than 35%), which means large pose shifts exist between adjacent frames.Since 2D image has the issue of dimension loss (from 3D to 2D), the 2D metric IoU cannot represent 3D motion effectively.For Experiments II ∼ IV, the tracking success rates under different VoIs are counted.VoI is divided into 10 equal parts from 0 to 1 at an interval of 0.1, which together with the VoI equaling 0 can form 11 bins.As exhibited in Fig. 10, due to the irregular distribution of the images in each bin, the local success rate cannot comprehensively reflect how the tracking is influenced by VoI.Therefore, we also count the accumulated success rate (ASR) of each bin, which is defined as where S i and "card" represent the number of success and frame number within the i-th bin, respectively.Particularly, the accumulated success rate of the last bin is the overall average success rate of the image sequence.
The success rates with different VoIs in Experiments II ∼ IV are reported in Fig. 11.As the VoI decreases, the tracking success rate declines gradually.The results of Experiment V are presented in Fig. 12 and Fig. 13.Due to the sampling operation, the number of images in each sequence declines to 333 in the modified RBOT dataset B. The range of each bin is expanded to 0.2 to improve the statistical stability of the results.The VoI difference of the identical images from the original RBOT dataset and the two modified datasets is counted.An example VoI counting result on the three regular "Squirrel" sequences is displayed in Fig. 14, which indicates that the frame extraction operation leads to decline in VoI, and causes difficulty in pose tracking.As demonstrated in Fig. 12 and Fig. 13, the tracking success rate curves of each sequence in Experiment V are similar  Since an image sequence is a sparse representation of the continuous motion, the shared visible vertices can be considered to be a temporalized characterization of the interframe motion continuity.Quantitatively, continuous motion corresponds to high VoI and little interframe pose dif-ference.Thus, motion continuity recovery is to find an intermediate pose much closer to the ground truth of the current frame than the previous frame.
In all region-based pose tracking methods, the pose of the previous frame is used for initializing pose optimization of the current frame.When large pose shifts exist, the optimization that requires good initialization for convergence will result in erroneous results.A better initialization corresponding to a pose closer to the ground truth is a promising way to cope with large pose shift issue.As aforementioned, the pose tracking can be deemed as the process of successive image mapping.For rotation-dominant large pose shift issue, especially when large out-of-plane rotation variations exist, few correspondence points remain between two adjacent frames, which makes it quite difficult for traditional methods to perform correct pose tracking.However, although large in-plane displacements corrupt the stability of a statistical segmentation model, a large number of correspondence points still exist between adjacent frames, making it possible to achieve pose recovery.Motivated by this, we propose using 2D tracking to achieve this goal.

Method
In the practical applications focused on in this paper, large pose shifts are translation-dominant.We propose tackling this problem by the aid of 2D tracking.In each frame, 2D tracking is performed to acquire the bounding box.If the 2D tracking succeeds, interframe feature extraction and matching is conducted in the image patches, which are constrained by 2D bounding boxes.A prior pose-based strategy is adopted to eliminate outliers.If a sufficient number of correspondences can be identified, the approximate pose, referred to as the intermediate pose, can be calculated by solving a PnP problem.For the case with few correspondences, a direct translation is performed to calculate the intermediate pose.Then, the intermediate pose is refined using a region-based method.Instead, if the 2D tracking fails at the beginning, the pose will be estimated by directly refining the initial pose, as all regionbased methods do, routinely.The flowchart of the proposed method is illustrated in Fig. 15.

Motion continuity recovery using 2D tracking
2D tracking has been widely studied and many outstanding methods have been developed.Related methods track 2D bounding boxes of an object within successive frames.We employ the STAPLE algorithm [38] that achieves excellent performance with high computational efficiency.More importantly, it works well in tracking objects with translation-dominant large pose shifts studied in this paper.
Based on the tracked 2D bounding boxes, the intermediate pose in the current frame can be estimated.According to the characteristics of the object, two different processing strategies are proposed for texture-rich and textureless objects, respectively.

Motion continuity recovery via localized region-based correspondence transferring
For texture-rich objects, the SURF algorithm [39] is chosen to identify feature points, and feature matching is performed based on fast library for approximate nearest neighbors (FLANN) within the tracked 2D bounding boxes.In this way, correspondences between 3D model and image points can be transferred into the current frame, as shown in Fig. 16.Since the previous frame has been perfectly segmented with a prior pose, a query mask will be rendered for outlier rejection.Feature points outside the Figure 16 2D-3D correspondence transferring through feature matching in tracked 2D bounding boxes that are marked in green.p = π(K(T C O P) 3×1 ) denotes the projection process of a specific 3D point into image space as illustrated in (3) query mask are excluded to eliminate erroneous matching.Example feature matching result between two real images is displayed in Fig. 17.The EPnP [34] and Levenberg-Marquardt [40] iterative algorithm with RANSAC (Random Sample Consensus) [41] are exploited to calculate the intermediate pose T init (t) = (R init (t), t init (t)) of the current frame.

Motion continuity recovery based on rotation-preserving assumption
For texture-less objects, feature points within the bounding boxes are not sufficient for correspondence transferring.A simplified strategy of direct translation is thus proposed, as follows.Region-based methods use the previous pose for initialization, which means pose is assumed constant between two adjacent frames and updated after pose optimization.Since the interframe shifts are translationdominant, we adopt the consistency assumption in rotation and attempt to improve the initialization by identifying a more rational translation.The origin [0, 0, 0, 1] of the object reference frame is assumed to be projected to the centers of the tracked 2D bounding boxes.Based on Equ.(3), the projections of the origin in I(t -1) and I(t) can be expressed as where [u(t -1), v(t -1)] and [u(t), v(t)] are the 2D centers of the tracked 2D bounding boxes in I(t -1) and I(t), respectively.Since STAPLE is scale-adaptive, the depth change is proportional to the scale change of the 2D bounding boxes.We define the scale factor s as the ratio of the heights of the tracked 2D bounding boxes as where h(t -1) and h(t) denote the heights of the tracked 2D bounding boxes in I(t -1) and I(t), respectively.According to Equ. ( 24), we estimate The intermediate pose T init (t) = (R(t -1), t init ), instead of that of the previous frame, is used to initialize the algorithm.Then, we adopt a region-based method to refine the intermediate pose and achieve robust pose tracking.

Region-based pose refinement
The above process can increase the VoI and recover the motion continuity within successive frames, which can meet the requirements of existing region-based pose tracking methods.The iterative pose optimization process proposed in [4] is adopted to refine the intermediate pose The authors of [4] optimized the pose with a sparse viewpoint model and correspondence line model.The energy function, in the form of conditional probability, is where D denotes the data from all correspondence lines, ω the sparse correspondence line domain, d the contour distance, l the color of pixel on the correspondence line, and N υ the number of randomly sampled points on the contour from the sparse viewpoint model.Then, the optimization combining the Newton method with Tikhonov regularization is used to calculate the pose variation vector: g, (28) where H and g are the Hessian matrix and gradient vector, respectively.λ r and λ t are the regularization parameters for rotation and translation, respectively.In the end, the pose is updated as 5 Experimental evaluation

Experimental setting
Extensive experiments on synthetic and real image sequences are conducted to validate the proposed method.Three recently published representative pose tracking methods [3,4], and [5] are used for comparison.The proposed method is implemented in C++.For all the compared methods, the default parameter settings suggested by the authors are adopted.All experiments are performed on a laptop equipped with an Intel Core i7 quad core @ 2.2 GHz and a NVIDIA Geforce GTX1060 GPU.The overall runtime of the proposed method for each image is about 5 ms for synthetic image and 10 ms for real image with a resolution of 1920 × 1080 pixels, meeting the real-time requirements of practical applications.The tracking success rate metric adopted is identical to Sect.3.3.Especially, the translation metric for evaluation on large pose shift synthetic image sequences is set to be 10 cm, whose average depth is set to be twice that of the RBOT dataset.If the tracking is lost, the tracking process will be reinitialized with the ground truth pose.

Evaluation on large pose shift synthetic image sequences
Due to the scarcity of large pose shift dataset, we first generate synthetic image sequences using BlenderProc, as described in Sect.3.All the 3D models and the dynamic backgrounds of the RBOT dataset are used.The reference frame setting is the same as that of Sect.3.3 depicted in Fig. 5.A same pre-defined trajectory of continuous 6D motion is set to generate 18 sequences.Each sequence contains 1001 images with a resolution of 640 × 512 pixels.The detailed pose parameter settings are listed in Table 2.
Table 2 Parameters set for generating large pose shift synthetic image sequences Table 3 Tracking success rates (%) of the proposed method in comparison to methods of [3,4] and [5] on the large pose shift synthetic image sequences.The optimal results are marked in bold (except results of ablation experiments listed in the fourth and the fifth row)  The first frame of each sequence is used to initialize the tracking algorithm, and all frames are labeled with ground truth poses for evaluation.Success rates are summarized in Table 3. Figure 18 presents the visualized results on six example consecutive frames.The color masks in the second to fifth rows are reprojections with pose results of corresponding methods.Obvious difference between the reprojection and the foreground exists in every image of the second and the third rows, indicating that the methods in [3] and [5] are unable to track the object in the example frames.The fourth row presents tracking results of [4], which achieves better performance in most images, except #180 and #181.The proposed method has achieved much more robust pose tracking through the example image sequence, as displayed in the fifth row.The last row shows feature points extraction and matching in the localized regions.
As illustrated in Table 3, the proposed method achieves superior performance in all sequences, and performs best on average.For the average success rates, the proposed method performs about 25%, 45% and 56% better than methods of [3], [4] and [5], respectively.To validate the performance of direct translation and 2D tracking, success rates of pose tracking using direct translation and direct translation with ground truth bounding boxes are counted and listed in the fourth and fifth row of Table 3, respectively.The results show that pose tracking using direct translation achieves much better performance than those fused with feature extraction and matching.In RBOT dataset, most of the 3D models are not texture-rich enough for robust feature extraction and matching in the limited bounding boxes.Therefore, direct translation performs much more effectively in these sequences.Especially, when ground truth bounding boxes are used to perform direct translation, pose tracking achieves 100% success rates in 12 sequences, which proves the effectiveness of the direct translation strategy.For Bench Vise, which is quite texturerich, the fused method performs slightly better than direct translation.
To validate the motion continuity recovery strategies of the proposed method, we count the VoI improvement results as shown in Fig. 19.Since the models, "Ape", "Baking Soda", "Broccoli Soup", "Clown", "Cube", and "Koala Candy", have too few vertices, sequences of these models are excluded from the VoI analysis.It can be seen from Fig. 19 that all 12 original sequences have low VoIs, indicating large pose shifts within successive frames.Using 2D tracking, the proposed method calculates an intermediate pose which corresponds to a view much closer to the current frame.The VoI is therefore significantly improved with interframe motion continuity recovered, allowing the proposed method to achieve robust pose tracking for objects with large pose shifts.

Figure 18
Tracking results of six consecutive frames (index from 177 to 182) from the sequence of model "Squirrel" in large pose shift synthetic image sequences.The first row shows the input images with the object position of the previous frames in red contours.The tracking results of [3][4][5] and the proposed method are shown in the second to the fifth rows, respectively.For each frame, the region of interest is enlarged for better visualization

Evaluation on the modified RBOT datasets
On the modified RBOT datasets, the large pose shifts manually generated by the frame extraction operation comprise both large rotation and large translation.Therefore, the pose shifts in the modified RBOT dataset B are more severe than those in the dataset A.
As presented in Table 4, the success rates of the proposed method and the compared methods on the dataset B are much lower than the corresponding values on dataset A. This observation indicates as the pose shift increases, the pose tracking becomes increasingly difficult.Since large rotations significantly change the shape of object contours, and large translations place the corresponding contours beyond the scope of search, the contour part model-based method [5] performs worst in every se-quence of the modified RBOT datasets.The large pose shifts interrupt interframe motion continuity, making the pose optimization of [3] and [4] converge to erroneous results.Thus, region-based methods [3] and [4] achieve much lower success rates in every sequence than their reported results on the original RBOT dataset.In contrast, although the simulated motion involved in the modified RBOT datasets is far more complex than that discussed in this paper, the proposed method still achieves better performance than the compared methods in most sequences and performs best on average.Due to large rotations, the STAPLE tracker cannot achieve the desired performance, and the motion continuity between frames cannot be recovered effectively.The proposed method only realizes modest improvements in success rates in comparison to the other three methods.This is also the reason for the limitations of the proposed method in tracking objects with large rotations.Compared to the modified RBOT dataset A, the interruption of motion is more severe in the dataset B. The proposed method achieves greater improvements in comparison to methods of [3], [4], and [5] on the modified RBOT dataset A. In the sequences of "Driller" and "Lamp", the two largest models in the RBOT dataset, the authors of [3] have adopted relatively large local regions, with a radius of 40 pixels, making it more robust to pose shift than the other methods.Therefore, the region-based method [3] achieves the optimal performance in "Driller" and "Lamp" sequences.

Evaluation on real image sequences
The proposed method is evaluated in comparison to the methods described in [3,4], and [5] on real image sequences.The real images are captured through controlling a pre-calibrated camera to move rapidly relative to the object.The captured sequence is composed of 254 images with a resolution of 1920 × 1080 pixels.The object used in this work is 3D printed and repainted based on the "Squirrel" model from the RBOT dataset.The background is cluttered, as shown in the first row of Fig. 20.For all the methods, an identical initial pose is acquired by solving a PnP problem with correspondences manually identified between the first image and the 3D model of "Squirrel".Figure 20 presents the sampled sequence and tracking results.The example images are sampled at an interval of 50 frames to show overall performance on the entire sequence.The region-based method [3] fails in most frames, with the overlays projected outside the field of view because of the invalid pose results, as shown in the second row.Due to violation of the motion continuity premise, the temporal consistency proposed in [3] could not be maintained, and the optimization tends to converge to erroneous poses.The method [5] requires limited contour variations within successive frames, and thus it also fails to achieve the desired performance (see the third row).Similar to [3], method [4] cannot cope with this problem of tracking objects with large pose shifts in real sequences, as presented in the fourth row.In contrast, the STAPLE tracker can work effectively in real sequences with cluttered backgrounds.Therefore, the interframe motion continuity can be recovered effectively, and region-based refinement can converge to more accurate results.As displayed in the fifth row, the proposed method tracks the object pose through the image sequence reliably and effectively.

Discussion
The proposed method has been validated through extensive experiments, both on the synthetic and real sequences.The challenges, such as texture-less objects, Table 4 Tracking success rates(%) of the proposed method in comparison to methods of [3,4] and [5] 20 Tracking results on real image sequences.The first row shows example input images.The object is denoted with a red contour.The second to fifth rows show reprojected overlays of [3][4][5] and the proposed method, respectively.For each frame, the region of interest is enlarged for better clarity noise, and occlusion, faced by pose tracking are covered in the experiments.Especially in the RBOT dataset, the backgrounds are significantly cluttered, and most of the objects are texture-less, which means that the proposed method is robust and can be generalized to other applications using different models and backgrounds.However, it still has some limitations.First, the proposed method is heavily dependent on the 2D tracking algorithm that mainly uses color feature to describe and identify an object.When the object is less-distinguishable from the backgrounds, which may be caused by the scarcity of texture, noise, motion blur, or cluttered backgrounds, it is difficult or even impossible to perform 2D tracking.Second, an effective and efficient reinitialization process is missing in the proposed method.For real sequences whose ground truth poses are not available, the pose tracking will fail completely when 2D tracking is lost.Inspired by [23], a template matching-based pose detection method is a promising way to perform reinitialization, which will be covered in our future work.Third, the proposed method cannot address the issue of large rotation, constrained by the limited performance of 2D tracking in large rotation scenarios.Recently, deep learning-based methods have been reported to show great potential in robust keypoint detection.For the future work, we will integrate the keypoint detection network with our method to improve its robustness in tracking objects with large rotation variations.

Conclusions
In this study, we focus on monocular pose tracking for objects with large pose shifts.A 2D tracking algorithm is proposed for tackling this problem.Theoretical analysis of how large pose shifts affect the pose tracking is firstly performed.The study shows that the stability of the color segmentation model and linearization in pose optimization are the key to robust pose tracking.Then, a metric named VoI to measure the pose shift based on the shared visible vertices between the successive frames is put forward.A 2D tracking method, STAPLE, is adopted to increase the VoI and attain intermediate poses for better initialization.
For texture-rich objects, the intermediate pose can be directly tracked by solving a PnP problem with a sufficient number of matched points within the tracked 2D bounding boxes.For texture-less objects with matched points of insufficient quantity, a direct translation is performed.Finally, a region-based pose refinement is used to obtain the final tracked pose.Experiments on synthetic and real image sequences show that the proposed method achieves superior performance to compared methods in tracking objects with large pose shifts.Although the proposed method has performed well in tracking objects with large pose shifts mainly caused by translation, it still has some limitations.In future work, a reinitialization strategy will be developed to improve the robustness to real scenarios.An efficient and effective keypoint detection method will be adopted to increase the VoI between successive frames.This will enable our method to deal with large pose shifts, whether caused by translation or rotation.

Figure 1
Figure 1 Example 3D model widely used in monocular pose tracking.The red points represent the vertices of the model.The model "Squirrel" is taken from the RBOT dataset[3]

Figure 2 Figure 3
Figure 2 Local region and segmentation model used in region-based pose tracking methods

Figure 4 Figure 5
Figure 4 Illustration of the shared visible vertices between two successive frames.The yellow patches in (a) and (b) represent the foreground intersection of two successive frames.Projections of shared visible vertices are shown in the form of red points in (a), (b) and (c).The green points in (a) and (b) are projections of the visible vertices in the foreground intersection from two corresponding views, excluding the shared visible vertices.The position of the object in the previous frame is shown in orange in (c).The green points in (d) represent all visible vertices on the 3D model in a specific view

Figure 6
Figure 6 An example of successive frames from sequences tested in Experiments I ∼ IV.The yellow mask or gray mask with red contour indicates the position of the object in the previous frame.The last three images show the large pose shifts mainly caused by translation, rotation and composite motion, respectively

Figure 7 Figure 8 Figure 9
Figure 7 3D models in the RBOT dataset

Figure 10
Figure 10 Illustration of calculation of the accumulated success rate.The blue number depicts the number of images in a specific bin.The counting of successful tracking in a bin is marked in red.The local success rate of each bin is indicated at the top of each color column.Numbers in bold represent the accumulated success rates.Due to an insufficient number of frames, an instantaneous reversal of the local success rate trend is seen in current bin (B 3 ), while the accumulated success rate is still declining

Figure 11 Figure 12
Figure 11 Tracking success rates under different VoIs of Experiments II, III, and IV are shown in (a), (b) and (c), respectively

Figure 13 Figure 14
Figure 13 Tracking success rates under different VoIs in Experiment V on the modified RBOT dataset B

Figure 15
Figure 15 Flowchart of the proposed method

Figure 17
Figure 17 Example of feature point extraction and matching results within tracked 2D bounding boxes.Feature points outside the object areas are shown in enlarged image patches surrounded by red bounding boxes, which can be excluded based on the pose identified in the previous frame ∼ 10 -10 ∼ 10 -10 ∼ 10 -200 ∼ 200 -150 ∼ 150 700 ∼ 1400

Figure 19
Figure 19 VoI improvement through continuity recovery on large pose shift synthetic image sequences

Table 1
Parameters set for synthetic image rendering in VoI validation Experiments I ∼ IV .

.1 35.7 31.2 36.9 23.7 53.2 46.8
on the modified RBOT datasets.The optimal results are marked in bold