1 Introduction

Monocular model-based 3D tracking methods are an essential part of computer vision. They are applied in a wide range of practical areas, from augmented reality to visual effects in cinema. 3D tracking implies iterative, frame-by-frame estimation of an object’s position and orientation relative to the camera, with a given initial object pose. Figure 1a shows a scene fragment typical for such a task. A number of characteristics complicate tracking: the object is partially occluded, there are flecks and reflections, and the background is cluttered.

Fig. 1.
figure 1

An example of a tracking algorithm applied to a single frame. (a) A fragment of the processed frame. Despite the partial occlusions of the tracked object (violin), most of its edges are visible. (b) The result of preliminary pose estimation. The model projection (white wireframe) does not coincide with the object image in the frame because the position of the keypoints (black crosses) used to determine its position was inaccurately calculated. (c) The object’s model with optimized energy 4 of contours (purple lines). (Color figure online)

In recent years, a great number of 3D tracking methods have been developed. These can be classified by the image characteristics and 3D model features used for pose detection. Many approaches [12, 18, 19, 26] are based on calculating keypoints [13, 22] on the image and corresponding points on the 3D model. Such methods make it possible to achieve high performance and robustness against partial occlusions and cluttered background and are capable of processing fast object movement. At the same time, their use is limited for poorly textured objects because keypoint calculation requires texture.

Nevertheless, objects tend to have one or other characteristic shape that lends itself to being detected in an image. Therefore, many methods use the information on the edges of the object’s 3D model projection—on its contour (for illustration, see Fig. 1c). As a rule, the contour of a 3D object corresponds to the areas of an image that are characterized by dramatic and unidirectional change in intensity—its edges[1] on an image. Methods [2, 4,5,6, 8, 9, 21, 25, 29] calculate the object pose from the correspondences between the points on the 3D model contour and points on the image edges. Some approaches are based on optimizing energy functions that determine the degree of correspondence between the object projection in its current position and its silhouette on an image [15, 20, 23, 28]. Authors of [15] propose a tracking method using integral object contour energy optimization; its value is the greater, the more precisely the object contour fits the edges on the image. The energy-based approaches risk detecting local energy function optima that correspond to the wrong object poses. In addition, edge-based methods have a drawback, namely the ambiguity of symmetrical objects, which have identical contours in different positions.

We present an approach that combines a keypoint-based method and integral edge energy optimization. A preliminary object pose is estimated using the Kanade–Lucas–Tomasi (KLT) [14, 22, 24] tracker. Then, we refine the object pose by optimizing the contour energy. We modify the energy function described in [15] to take into account the image edges’ intensity as well as the directions of the model contour and the edges. We limit the search area for the optimal energy function value by using the information obtained from the preliminary object pose estimation. This allows for better convergence and makes it possible to partially overcome the issue of symmetrical objects. For optimization, we use the Basin-Hopping stochastic algorithm [27] to avoid local optima. In particular, we use an efficient Quasi-Newton method [10] considering the search area constraints directly. Contour detection is a time-critical operation that is executed thousands of times during the energy optimization. We propose an algorithm that performs the most time-consuming computations as a preliminary step. This increases tracking performance.

In this paper, we concentrate on frame-to-frame object tracking that relies only on the initial object pose and doesn’t require any training using additional data (such as predefined reference frames).

We demonstrate the efficiency of our approach as compared to state-of-the-art 3D tracking methods in a series of experiments conducted on the OPT benchmark dataset [30]. The dataset includes 552 test videos with ground truth object pose annotation. The videos reproduce different lighting conditions and the main movement patterns performed at a various speeds. To compare the tracking efficiency, we use the metric proposed by the authors of the OPT dataset. Our test results demonstrate that, across the whole dataset, our method yielded a value greater by \(9.1\%\)\(48.9\%\) with respect to the possible maximum than other methods tested. However, it should be noted that our method is not suitable for real-time applications.

Further, we discuss the work related to the topic of this article in Sect. 2. Then, we provide a brief overview of the mathematical notation in Sect. 3. In Sect. 4, we give a detailed description of the proposed tracking method. In Sect. 5, we discuss the experimental results and provide a comparison to other modern tracking methods. And, finally, in Sect. 5.4, we cover the limitations of our method as well as our plans for its further improvement.

2 Related Work

A detailed description of 3D tracking methods can be found in [11, 16]. In the present section, we shall discuss solely the approaches based on the information on the contour of 3D objects.

RAPID [8] was one of the first methods where the object pose was estimated based on the correspondences between points on a 3D model contour and points on the edges of the image. To detect correspondences, they use local 1D search of the point with the largest edge intensity along the perpendicular to the contour of the projected 3D model. In subsequent papers [2, 4, 5, 9, 21], a number of improvements on the method were proposed; however, they were all based on independent search for correspondences between points on the model contour and points on the edges of the image. The main drawback of this approach lies in the fact that the edge points of different objects can hardly be distinguished from each other. This leads to 3D-2D correspondences containing a great number of outliers, failing to preserve the object’s contour, especially in case of cluttered background, occlusions, or fast movement.

Other approaches introduce energy functions, where the value is the greater, the greater the correspondence between the 3D model projection and the image. Therefore, the tracking goal can be achieved through the optimization of such energy. In [15], the authors propose using two variants of integrals along the contour of the 3D model projected onto the image gradient. One variant takes into account only the absolute gradient value, while the other accounts only for the direction of the gradient in points with sufficient absolute gradient value. For energy optimization, the authors propose using coordinate descent. For effective convergence, this method requires a very close approximation to the sought-for optimum, which is calculated with the help of a method similar to RAPID.

The approach described in the present article uses a similar energy function. To improve convergence and overcome the issue of local optima, we use a global optimization technique based on the Basin-Hopping algorithm [27]. Methods in [2, 3, 9, 28] use a particle filter to avoid local optima. In [3], for particle initialization, a keypoint-based approach is used, which leads to a more robust algorithm and less ambiguity during tracking symmetrical objects. For successful convergence in noise conditions and fast object movement, it becomes necessary to use a great number of particles, which has a negative impact on tracking performance. Unlike the particle filter, the Basin-Hopping algorithm takes into account the information on the local optima that have already been identified and makes it possible to use non-trivial termination criteria, thus avoiding excessive calculations.

In addition to the information on the edge on the image, many methods also use the information on color. In [21, 29], the color distribution in the object and its background around the edge point is used to eliminate false 3D-2D correspondences. Methods described in [20, 23] optimize energy based on the color distribution in the whole object and background. Such approaches are robust against partial occlusions and cluttered scenes; however, they are sensitive to changes in lighting and ambiguities arising from a similar coloring of the object and its background.

3 Input Data and Pose Parametrization

This section provides a brief overview of the mathematical notation used in the present article.

The tracking algorithm accepts input data in the form of sequential grayscale image frames \(I_i :\varOmega \rightarrow \mathbb {R}\) (where \(\varOmega \subset \mathbb {R}^2\) is the image domain). Intensity of point \( u \in \varOmega \) in the frame i equals \(I_i( u )\). In cases where the number of the frame is unimportant, the image is labeled as \(I\).

The intrinsic parameters of the camera used to make the input frames are assumed to be constant. They are given as the matrix


3D model \(\mathcal {M}\) describing the tracked object can be defined as \((V_\mathcal {M}, F_\mathcal {M})\), where \(V_\mathcal {M}\subset \mathbb {R}^3\) is a finite set of model vertices, while \(F_\mathcal {M}\) is the set of triplets of vertices defining model faces.

Object pose within frame i has six degrees of freedom and is described as

$$\begin{aligned} \varvec{T}_i = \left[ \begin{array}{ccc|c} &{} \varvec{R}_i &{} &{} t _i \\ \hline 0 &{} 0 &{} 0 &{} 1 \end{array} \right] \in SE (3)\,\text {,} \end{aligned}$$

where \(\varvec{R}_i \in SO (3)\) defines the orientation of the model and \( t _i \in \mathbb {R}^3\) defines its translation.

The projection \( u \in \mathbb {R}^2\) of a point on the model surface \( x \in \mathbb {R}^3\) is described by a standard camera model

$$\begin{aligned} \mathbf {u} = \varvec{K}\cdot [\varvec{R}_i | t _i] \cdot \mathbf {x} \,\text {,} \end{aligned}$$

where \(\mathbf {u}\) and \(\mathbf {x}\) are vectors \( u \) and \( x \) in homogeneous coordinates. The function performing the projection \( x \mapsto u \) in the pose \(\varvec{T}_i\) shall be designated as \(\pi _{\varvec{T}_i}\).

4 3D Tracking by Combining Keypoints and Contour Energy

We solve the tracking task in two steps. During the first step, the Kanade–Lucas–Tomasi (KLT) tracker [14, 22, 24] is applied to estimate object pose. During the second step, object pose is refined by optimizing the objective function, i.e. the contour energy optimization.

The present section provides a detailed description of the proposed tracking method. First, we give a brief overview of the initial object pose estimation algorithm. Further, we concentrate on the method for pose refinement. First of all, we give a detailed description of the contour energy. Then, we discuss its optimization: we provide an overview of the global optimization method and the local optimization method that it utilizes. Then, we propose a step-by-step procedure to refine the object pose by using the method described. After that, we discuss bound constraints estimation for the energy optimum search area. And, finally, we describe the object contour detection algorithm.

4.1 Initial Object Pose Estimation

We estimate the initial object pose with the help of a wide-known KLT tracker. On the frames where the object pose is known, we identify 2D keypoints and determine corresponding 3D points on the surface of the model. Keypoint movement is tracked with the help of optical flow calculation. On the image, the known 2D-3D correspondences are used to estimate the object pose by solving the P\(n\)P problem [11] while using RANSAC [7] to eliminate outliers.

When after a sufficiently great number of iterations RANSAC fails to find a solution with an acceptable percentage of outliers, or when the number of points tracked is very small, we estimate object pose in frame i by extrapolation based on poses in frames \(i - 1\) and \(i - 2\).

4.2 Contour Energy

We view pose optimization as the matching of model contour with object edges in the image. Model contours are understood as two types of lines. Firstly, it is the outer contour of the projected model. Secondly, it is the projections of visible sharp model edges. Sharp model edges are the edges where adjacent faces meet at an angle no greater than the pre-selected one \(\varphi \). Figure 1c provides an example of such matching.

To further ideas described in [15], we suggest the following energy function for a quantitative expression of matching quality:


where \({C_{\varvec{T}}}\) are model contour lines, \({p_{\varvec{T}}}\) is the function returning the contour point coordinate in the image, \({n_{\varvec{T}}}\) is the function returning the normal unit vector to contour.

The energy is an integral characteristic of the contour (numerator) normalized along the length of the contour (denominator). The division by the length of the contour is done to avoid the case where the long contour is preferred to the shorter one.

Let us consider the numerator expression under integral sign:


The image gradient \(\nabla I({p_{\varvec{T}}}(s))\) shows the direction and strength of intensity change in a point. If there is an edge, gradient is perpendicular to it. The unit vector \({n_{\varvec{T}}}(s)\) is perpendicular to the model contour. The absolute value of their scalar product is the greater, the more visually significant the edge in the image is (i.e. the greater the gradient magnitude) and the higher the correspondence of its direction in the current point and the direction of the model contour.

Therefore, the value \(\mathcal {E}(\varvec{T})\) is the greater, the greater the correspondence of the model contour in pose \(\varvec{T}\) and the edges in the image and the more visually significant those edges (for example, see Fig. 1c). Given that the object edges are sufficiently visible, the energy, in most cases, will be maximal in the sought-for pose. Therefore, the optimal object pose in frame i can be found as

$$\begin{aligned} \varvec{T}_i = \mathop {\mathrm {arg\,max}}\limits _{\varvec{T}} \mathcal {E}(\varvec{T})\text {.} \end{aligned}$$

Due to integral nature of the energy function 4, it is sufficiently robust against occlusions. Its disadvantage lies in the fact that, potentially, cases where the wrong object pose will have greater energy than its true pose are possible. However, practical experience shows that such cases are quite rare. In addition, ambiguities in detecting the pose of objects of a symmetrical (e.g., cylindrical) shape may be observed.

To implement the evaluation of contour energy, it is necessary to perform the discretization of expression 4:


where \({\tilde{C}_{\varvec{T}}}\) is the finite set of points uniformly distributed along the contour lines.

A detailed description of contour line detection is described in Sect. 4.5.

4.3 Energy Optimization Method

In most cases, contour energy 4 has a notable global optimum in the area of the sought-for object pose and, at the same time, shows a plenty of local optima at a certain distance from it (see Fig. 2).

Fig. 2.
figure 2

Examples of contour energy 4 near an optimum. In the top row the energy was calculated from the original image; in the bottom row it was calculated from an image blurred by convolution with a Gaussian kernel. Label \(\varvec{R}_z\) denotes the dependence on the rotation around axis z (the other degrees of freedom being fixed), \( t _x\) and \( t _y\) denote the dependence on translation along axes x and y respectively. The areas demonstrated here correspond \(60^\circ \) for the rotation and approximately 0.2 of object size.

In many cases, the first approximation obtained during object pose estimation is not in the concave region near the sought-for optimum. On the other hand, we may assume that such first approximation will turn out to be good enough; therefore, we propose to limit the search area and then apply an optimization method capable of overcoming the local optima. A detailed description of search area bounds estimation is given in Sect. 4.4.

For optimization, the version of the Basin-Hopping stochastic algorithm described in [27] was selected. Basin-Hopping is an iterative algorithm. At each step, a random hop within the search area is made; after that, local optimization is performed and, then, the obtained local optimum is either accepted or rejected based on the Metropolis criterion [17]. The algorithm stops once the maximum number of iterations has been reached or if several previous steps did not serve to improve the optimum and the minimal number of iterations has been performed.

For local optimization, the SLSQP algorithm was selected [10]. It combines the option of limiting the search area and the efficiency of Quasi-Newton methods.

The energy gradient is computed numerically.

To improve convergence, we blur the frame by convolution \(G_{\sigma }*I\) with the Gaussian kernel \(G_{\sigma }\) before optimization. Blurring effectively suppresses noise and, along with it, high-frequency details. It has a positive impact on the smoothness of the energy function, but may cause a slight displacement of the sought-for optimum (see Fig. 2). We eliminate this displacement via adding a final optimization step using the original image.

4.4 Search Area Bounds

During initial pose estimation (Sect. 4.1), the object pose can be obtained either with the help of the KLT tracker or through extrapolation based on the previous poses. The former differs from the latter in that there is extra data present; therefore, we shall consider them separately.

In case the object pose could be obtained only through extrapolation, we propose selecting the maximal deviations from the estimated pose based on the assumption on the degree of object movement in consecutive frames. In our experiments, we have limited the rotation around each Euler angle to \(\pm 30^\circ \), the translation along the camera axis to \(\pm 0.2\) of the object size, and the translation along the other axes to \(\pm 0.1\) of the object size.

In case of success, the KLT tracker estimates the object pose by solving the P\(n\)P problem on a set of 2D and 3D point correspondences \(\{( u _1, x _1), \dots , ( u _{m}, x _{m})\}\) by minimizing the average reprojection error \(\widetilde{\varvec{T}} = \mathop {\mathrm {arg\,min}}\nolimits _{\varvec{T}} e(\varvec{T})\), where


The average reprojection error \(e(\varvec{T})\) can be understood as the measure of consistency of object pose \(\varvec{T}\) with the position of keypoints used to reach the solution: the smaller the error, the greater the consistency. Due to errors arising during keypoint tracking, the object pose that is most consistent with them may be different from its true pose, but it will be in its near neighborhood.

We propose selecting search area bounds in such a way that, when approaching them, \(e\) does not increase by a value greater than pre-selected one \(\varepsilon \), i.e. the consistency with keypoints does not deteriorate below a given threshold:

$$\begin{aligned} e(\varvec{T}) - e(\widetilde{\varvec{T}}) \le \varepsilon \text {.} \end{aligned}$$

The optimization methods proposed in Sect. 4.3 make it possible to set such non-linear constraints on the search area directly.

The size of the search area usually is quite natural in practice. For example, it is often the case that a very noticeable change of object pose in certain directions leads to a relatively small change in average reprojection error. This mostly concerns movement along the camera axes. This can also happen when keypoints are not evenly distributed throughout the object and cover it only partially. Along with errors in keypoint position, this normally results in noticeably inaccurate object pose estimation; for example, see Fig. 1b. Setting the bounds accounting for average reprojection error makes it possible to obtain a broad enough search area and then use energy optimization to find a rather precise pose, as shown in Fig. 1c.

4.5 Model Contour and Sharp Edges Detection

Visible contour and sharp edges detection algorithms can be grouped into two categories. The first type of algorithms is based on model rendering. They are precise and allow for correct processing of self-occlusions. However, rendering requires time-consuming computing. The second type of algorithms is based on the analysis of the model itself: its edges and the spatial relationship between the adjacent faces. They are less complex in terms of computing, but they fail to account for self-occlusion.

We propose a combined approach. Prior to tracking, we perform the most time-consuming calculations and gather data on the visibility of model faces from various points of view. After that, while estimating the contours during tracking, in a single run, we identify sharp and contour edges and process self-occlusion with the help of this data.

When calculating energy 4, the object pose \(\varvec{T}\) for which the contours need to be estimated is known. For this purpose, all model \(\mathcal {M}\) edges formed by two faces are reviewed. Out of these, the ones lying on the contour or formed by faces meeting at an acute angle are selected.

Fig. 3.
figure 3

Edge \( p q \) type recognition. (a) Contour edge is formed by front face \(( v _1, q , p )\) (marked by green color) and back face \(( v _2, p , q )\) (marked by pink hatch). (b) A sharp edge is formed by two front faces angled at under \(\varphi \). (c) Our contour detection algorithm rejects the edge if any adjacent front face is invisible from all of three the nearest preprocessed viewpoint directions. (Color figure online)

Let us consider edge \( p q \) and its adjacent faces \(( v _1, q , p )\) and \(( v _2, p , q )\). The edge is considered sharp if both of its faces are turned with their outer surface towards the camera and the angle between them is no greater then \(\varphi \) (see Fig. 3b). The edge is considered to be lying on the contour if one of its faces is turned towards the camera and the other is not (see Fig. 3a). It is obvious that some of the edges identified in this manner will be invisible in case of self-occlusions. To eliminate a major part of invisible edges, it is sufficient to know which of the front faces are invisible for the current point of view.

Data on model face visibility is collected prior to tracking. For this purpose, the model is rendered from several viewpoint directions with orthographic projection camera. To reach uniform distribution of these directions, we use vertices of an icosphere (recursively divided icosahedron) surrounding the model. Given an object pose, we determine three preprocessed viewpoint directions that are close to the current direction and form a face of the icosphere as shown in Fig. 3c. The faces invisible from this three directions will be likely invisible in the current pose. Knowing such faces, we can exclude adjacent contour and sharp edges. More formal description of self-occlusion processing can be found in supplementary material.

Having detected contour and sharp edges and having processed self-occlusion, it is easy to project the edges onto the frame and select points for numerical calculation with the formula 7.

An obvious drawback of the described approach is that it is not perfectly precise in self-occlusion processing, which can lead to a certain percentage of wrongly detected contours. In true model pose, they are unlikely to correspond to the edges on the image, but may potentially occur on some of the edges in the wrong pose. However, in most cases, a small amount of such false contours does not lead to wrong pose estimation.

The advantage of the described algorithm is that the calculations performed during tracking are relatively simple and less computationally complex compared to algorithms that require rendering.

5 Evaluation

To prove the efficiency of the method described in the present article, we have tested it on the OPT dataset [30] and compared our results with those obtained from state-of-the-art tracking approaches. The test dataset includes RGB video recordings of tracked objects, their true pose, and 3D models. The videos are \(1920\times 1080\) and have been recorded with the help of a programmable robotic arm under various lighting conditions. Object movement patterns are diverse and many in number and the velocity of movement also varies. OPT contains six models of various geometric complexity. For each of these, 92 test videos of varying duration (23–600 frames) have been made, the total number of frames being 79968. The diverse test data covers most of the motion scenarios.

Our method has been implemented in C++ and is part of a software product intended for 3D tracking for film post-production. All experiments were conducted on a computer with an Intel i7-6700 3.4 GHz processor and 32 GB of RAM. Details of our method settings are given in supplementary material.

In the present section, we first describe the approach for results evaluation used to compare our method with other tracking methods. Further, we show the efficiency of object pose optimization based on contour energy. Then, we compare the results from our approach to those obtained by other modern tracking methods. To conclude, we discuss the advantages and drawbacks of our method as well as potential ways for its improvement.

5.1 Evaluation Metric

Given the known true object pose \(\hat{\varvec{T}}_i\) in frame i, we calculate the estimated pose \(\varvec{T}_i\) error as


where \(V_\mathcal {M}\) is the set of 3D model vertices. We consider the object pose within the frame successfully detected if \(\delta _i\) is less than kd, where d is the diameter of the 3D model and k is a given error coefficient.

To compare the efficiency of different methods, we create a curve where each point is defined as the percentage of frames where object pose with respect to varying k was successfully determined. The more efficient method of object pose tracking corresponds with the greater value of AUC (area under curve). In our experiments, k varies from 0 to 0.2; therefore AUC varies from 0 to 20.

5.2 Effectiveness of the Contour Energy Optimization

Let us show how the contour energy optimization improves the results of pure Kanade–Lucas–Tomasi tracker using an example; see Fig. 4. When the object poses obtained with the help of KLT tracker are not refined through the contour energy optimization, the object gradually descends onto the background. In the final frames the discrepancy with the actual object pose becomes significant. This behavior is due to frame-by-frame error accumulation. The refinement step using the contour energy optimization eliminates the errors and results in keeping the object close to its actual pose in all frames.

Experiments conducted on the OPT dataset confirm a significant increase in tracking efficiency due to contour energy optimization. Thus, in all tests, the AUC value increased by \(18\%\), while in the group of tests with maximum movement velocity the increase was \(27\%\).

Fig. 4.
figure 4

An example of tracking efficiency increase due to contour energy optimization. Upper and lower rows show video frame fragments (the frame numbers are in the top right corners) with the 3D model (black wireframe) projected in the pose estimated with the help of pure KLT tracker and with the help of our method, respectively.

5.3 Comparison with State-of-the-Art Methods

We have compared our approach with the following state-of-the-art methods: GOS [29], PWP3D [20] and ORB-SLAM2 [18]. GOS is an edge-based method improved the approach from [21]. It takes into account color distribution around the edges and contour coherence. PWP3D tracks object by segmenting the frame into object and background and using color distribution statistics. ORB-SLAM2 is state-of-the-art simultaneous localization and mapping approach based on keypoints detection. The authors of [30] proposed modifications that allowed applying this method to 3D model tracking. Tracking results for PWP3D and ORB-SLAM2 applied to the OPT dataset are cited from [30]. Tracking results for GOS were received from testing the open source implementation. During testing, all methods were initialized with the true object pose in the first frame of each video.

Figure 5 demonstrates the results from some of the test groups in OPT. Tables 1 and 2 contain the complete detailed testing results. Overall, all tests show a noticeable disadvantage of other methods as compared to ours: GOS by 35.6%, PWP3D by 48.9%, ORB-SLAM2 by 9.1% (all values have been calculated with respect to the maximum possible value).

Table 3 contains average frame processing times of tested methods. ORB-SLAM2 and PWP3D are positioned as real-time methods and they show the best run time performance. At the same time, our method and GOS are not suitable for real-time applications and our method is in 3.4 times slower than GOS on average. In fact, run time of our method significantly depends on the size of object model as shown in Table 4.

More detailed results of experiments can be found in the supplementary material.

Fig. 5.
figure 5

Comparison of method efficiency on various test groups in OPT

Table 1. Comparison of AUC in tests with different objects
Table 2. Comparison of AUC in tests with different tracking conditions
Table 3. Comparison of average frame processing time (ms)
Table 4. Dependency of our method average frame processing time on object size

5.4 Discussion and Future Work

Testing results provided in Tables 1, 2 and in Fig. 5 show that our method demonstrated good results under various tracking conditions.

Tests under moving light and flashing light show high performance in spite of the fact that the KLT tracker lacks robustness against dramatic lighting changes. It is also worth noting that our method significantly better processes object movement along the camera axis (‘z translation’ tests) in comparison with other methods. Edge-based methods are very sensitive to motion blur. Nevertheless, the results of ‘Fast Motion’ tests demonstrate that our method handles motion blur and fast movement more efficiently than other methods.

Symmetrical objects in different poses may have identical contours, which leads to ambiguity during contour-based object pose estimation. By limiting the pose search area, we mostly overcome this issue. Test results from a symmetrical object, Soda, confirm this finding. Our method successfully tracks this object while the GOS—another edge-based method—shows low efficiency.

Disadvantage of our method is the negative impact that lack of model accuracy has on tracking quality. This is presented in Table 1: tracking results for the Bike object are noticeably lower than those for other objects. Our method is also not devoid of the drawback typical for most edge-based approaches—jitter. To improve the efficiency of the method in the above-mentioned cases, in the future, we are planning to use edges on the inner texture of the object.

A major limitation of our method is low run time performance (see Table 3). Also, frame processing time significantly depends on object model size as shown in Table 4. This is due to the following fact: the most time-consuming part of our method is contour energy optimization, where model contour identification is the operation repeated most frequently. These calculations could be accelerated using GPU.

6 Conclusions

The present article introduced the method for model-based 3D tracking based on combination of model contour energy and keypoints. Kanade–Lucas–Tomasi tracker is used for preliminary pose estimation. Then the pose is refined by the contour energy optimization using the Basin-Hopping stochastic algorithm. To improve optimization, we set search area constraints based on keypoints average reprojection error. Such constraints fix well-tracked parts of the object while allows movement of parts which are failed to be correctly positioned by KLT tracker. The results of experiments on a challenging benchmark dataset show that combining edge-based and keypoint-based approaches can diminish the typical disadvantages of both methods. Contour energy optimization effectively struggle with motion blur and poorly textured surfaces, while keypoints help to correctly process symmetrical objects. We demonstrated the efficiency of our approach by testing it against state-of-the-art methods on a public benchmark dataset.