1 Introduction

Tracking a rigid object in 3D space and predicting its six degrees of freedom (6DoF) pose is an essential task in computer vision. Its application ranges from augmented reality, where the location of objects is needed to superimpose digital information, to robotics, where the object pose is required for robust manipulation in unstructured environments. Given consecutive image frames, the goal of 3D object tracking is to estimate both the rotation and translation of a known object relative to the camera. In contrast to object detection, tracking continuously provides information, which, for example, allows robots to react to unpredicted changes in the environment using visual servoing. While the problem has been thoroughly studied, many challenges such as partial occlusions, appearance changes, motion blur, background clutter, and real-time requirements still exist. In this section, we first provide an overview of common techniques. This is followed by a survey of related work on region-based methods. Finally, we introduce our approach and summarize the contributions to the current state of the art.

1.1 3D Object Tracking

In the past, many approaches to 3D object tracking have been proposed. Based on surveys (Lepetit and Fua 2005; Yilmaz et al. 2006), as well as on recent developments, techniques can be differentiated by their use of key-points, explicit edges, direct optimization, deep learning, depth information, and image regions. Key-point features such as SIFT (Lowe 2004), ORB (Rublee et al. 2011), or BRISK (Leutenegger et al. 2011) have been widely used for 3D object tracking (Wagner et al. 2010; Vacchetti et al. 2004), with more recent developments like LIFT (Yi et al. 2016) and SuperGlue (Sarlin et al. 2020) introducing deep learning at various stages. Explicit edges provide an additional source of information that is used by many approaches (Huang et al. 2020; Bugaev et al. 2018; Seo et al. 2014; Comport et al. 2006; Drummond and Cipolla 2002; Harris and Stennett 1990). Also, direct methods (Engel et al. 2018; Seo and Wuest 2016; Crivellaro and Lepetit 2014), which optimize a photometric error and can be traced back to Lucas and Kanade (1981), have been proposed. While all three classes of techniques have valid applications, unfortunately, they also have significant drawbacks. First, approaches based on key-points and direct optimization require rich texture, limiting the range of suitable objects. At the same time, edge-based methods, which perform better for low-textured objects, often fail in cluttered scenes. Finally, motion blur changes the appearance of both texture and edges, leading to additional problems.

To overcome these issues and train the algorithm on data, recently, deep-learning-based approaches that use convolutional neural network (CNNs) to consider full image information have been proposed. While they achieve good results, only few approaches (Wen et al. 2020) run in real-time, with most methods (Deng et al. 2021; Wang et al. 2019; Li et al. 2018; Xiang et al. 2018; Garon and Lalonde 2017) reporting less than 30 frames per second. However, even the most efficient algorithms require significant resources from high-end GPUs. In addition, typical disadvantages include time-consuming training and the requirement for a textured 3D model. Another relatively new development is the availability of affordable depth cameras that measure the surface distance for each pixel. While purely depth-based object tracking is possible, most methods (Ren et al. 2017; Kehl et al. 2017; Tan et al. 2017; Krull et al. 2015; Krainin et al. 2011) combine information from both depth and RGB cameras. In general, this leads to superior results. Unfortunately, in many applications, using an additional depth sensor is not an option. Also, note that algorithms require images with high quality. Depending on hardware, surface distances, surface characteristics, and lighting conditions, such images are hard to obtain.

Because of the discussed shortcomings, region-based techniques (Stoiber et al. 2020; Zhong et al. 2020b; Tjaden et al. 2018; Prisacariu and Reid 2012) have become increasingly popular. The big advantage of such methods is that they are able to reliably track a wide variety of objects in cluttered scenes, using only a monocular RGB camera and a texture-less 3D model of the object. The main assumption is thereby that objects are distinguishable from the background. As a consequence, no object texture is needed. While past approaches were computationally expensive, our sparse formulation overcomes this disadvantage. Finally, based on our experience, region-based methods are robust to motion blur, making it possible to track fast-moving objects. Because of these excellent properties, the following work focuses on region-based techniques.

Fig. 1
figure 1

Tracking of a marker pen in the real world. The image on the left shows a rendered overlay of the object model for the initial pose. The estimated pose after the optimization is visualized in the image on the right. The three illustrations in the middle show yellow correspondence lines for different scales s. High probabilities for the contour location are illustrated in red. Pixel-wise posterior probabilities that describe the probability that a pixel belongs to the background are encoded in grayscale images. Note that during tracking, pixel-wise posteriors are only calculated along correspondence lines (Color figure online)

1.2 Related Work

Region-based methods use image statistics to differentiate between a foreground region that corresponds to the object and a background region. Typically, color statistics are used to model the membership of each pixel. Based on the two regions, the goal is to find the object pose and corresponding silhouette that best explains the segmentation of the image. The great potential of this technique was already demonstrated by early approaches that allowed robust tracking in many challenging scenarios (Schmaltz et al. 2012; Brox et al. 2010; Rosenhahn et al. 2007). Segmentation and pose tracking were thereby treated as independent problems, with an initial step to extract the contour and a subsequent optimization to find the pose. Dambreville et al. (2008) later combined the two processes in a single energy function, leading to improved tracking robustness. Building on this approach and including the pixel-wise posterior membership of Bibby and Reid (2008), Prisacariu and Reid (2012) developed PWP3D, a real-time-capable algorithm that uses a level-set pose embedding. It is the foundation of almost all state-of-the-art region-based methods.

Based on PWP3D, multiple algorithms were suggested that incorporate additional information, extend the segmentation model, or improve efficiency. For the combination of both depth- and region-based information, Kehl et al. (2017) extended the energy function of PWP3D with a term that is based on the Iterative Closest Point (ICP) algorithm. In a different approach, Ren et al. (2017) tightly coupled region and depth information in a probabilistic formulation that uses 3D signed distance functions. Recently, object texture was considered using direct optimization of pixel intensity values (Liu et al. 2020; Zhong and Zhang 2019) or descriptor fields (Liu et al. 2021). Also, a combination with an edge-based technique that uses a contour-part model was introduced by Sun et al. (2021). Later, Li et al. (2021) developed adaptively weighted local bundles that combine region and edge information. To improve occlusion handling, Zhong et al. (2020a) suggested the use of learning-based object segmentation. Finally, the incorporation of measurements from a mobile phone’s inertial sensor was suggested by Prisacariu et al. (2015).

To improve segmentation, Zhao et al. (2014) extended the appearance model of PWP3D with a boundary term that considers spatial distribution regularities of pixels. Later, Hexner and Hagege (2016) proposed the use of local appearance models that were inspired by the localized contours of Lankton and Tannenbaum (2008). The idea was further improved by Tjaden et al. (2018) with the development of temporally consistent local color histograms. Finally, Zhong et al. (2020b) proposed a method that introduces polar-based region partitioning and edge-based occlusion detection.

For better efficiency, Zhao et al. (2014) suggested a particle-filter-like stochastic optimization that initializes a subsequent damped Newton method. Later, a hierarchical rendering approach that uses the Levenberg-Marquardt algorithm was developed by Prisacariu et al. (2015). Also, Tjaden et al. (2018) proposed the use of a Gauss-Newton method to improve convergence. In addition to optimization, another idea towards better efficiency is the use of simplified signed distance functions (Liu et al. 2020). A different approach by Kehl et al. (2017) suggested the use of precomputed contour points to represent the object’s 3D geometry and calculate the energy function sparsely along rays. Finally, in our previous work (Stoiber et al. 2020), we improved on this idea and developed a sparse approach that is based on correspondence lines, making our algorithm significantly more efficient than the previous state of the art while achieving better tracking results.

1.3 Contribution

Starting from the ideas presented in the previous section, we focus on the development of SRT3D, a highly efficient, sparse approach to region-based tracking. To keep complexity at a minimum, we only use region information and, like PWP3D, adopt a global segmentation model. For our formulation, we build on our previous method and consider image information sparsely along correspondence lines. Also, Newton optimization with Tikhonov regularization is used to estimate the object pose. An illustration of the tracking process with converging correspondence lines at different scales is given in Fig. 1. While the formulation is similar to our previous method (Stoiber et al. 2020), our main motivation is to advance the approach and the current state of the art using improved uncertainty modeling and better optimization techniques. In addition, we provide a more formal derivation and analysis of the highly efficient correspondence line model. In detail, the main contributions of this work are as follows:

  • A formal definition of correspondence lines and a thorough mathematical derivation of the probabilistic model that describes the contour location.

  • Novel smoothed step functions that allow the modeling of both local and global uncertainty.

  • A detailed theoretical analysis that shows how different parameter settings affect the characteristics of posterior probability distributions.

  • Global and local optimization strategies and a new approximation for the local first-order derivative.

In the remainder, we first provide a detailed derivation of the correspondence line model. This is followed by the development of a 3D tracking approach that combines the correspondence line model with a sparse representation of the 3D object geometry. Subsequently, implementation details for the resulting algorithm are discussed. Finally, we conduct a thorough evaluation on the RBOT and the OPT dataset, showing that our approach outperforms the current state of the art by a considerable margin in terms of efficiency and quality.

2 Correspondence Line Model

In this section, we first provide a formal mathematical definition of correspondence lines. This is followed by a probabilistic model that considers the segmentation of a correspondence line into foreground and background. To improve computational efficiency, we extend this model and provide a discrete scale-space formulation. Finally, we introduce novel smoothed step functions and discuss how their configuration affects the contour location’s posterior probability.

2.1 Correspondence Lines

In contrast to most state-of-the-art algorithms, we do not consider image information densely over the entire image. Instead, inspired by RAPID (Harris and Stennett 1990), pixel values are processed sparsely along correspondence lines (Stoiber et al. 2020). The name correspondence line is motivated by the term correspondence point used in ICP (Besl and McKay 1992). Similar to ICP, correspondences are first defined and the optimization with respect to them is then conducted in a second step. While for ICP, individual 3D points are used as data, multiple pixel values along a line are considered in this case. A visualization of a single correspondence line is shown in Fig. 2.

Fig. 2
figure 2

Correspondence line defined by a center \(\varvec{c}\) and a normal vector \(\varvec{n}\). The illustration shows pixels along the correspondence line as well as the foreground region \(\omega _\text {f}\) in yellow and the background region \(\omega _\text {b}\) in blue. The contour distance d points from the correspondence line center to an estimated contour, indicated by a dashed line (Color figure online)

Starting from our earlier work (Stoiber et al. 2020) and inspired by the commonly used definition of images as \(\varvec{I} :\varvec{\varOmega } \rightarrow \{0, \dots , 255\}^3\), we formally denote a correspondence line as a map \(\varvec{l}:\omega \rightarrow \{0, \dots , 255\}^3\). In this notation \(\varvec{\varOmega } \subset {\mathbb {R}}^2\) describes the image domain while \(\omega \subset {\mathbb {R}}\) is considered the correspondence line domain. Image values \(\varvec{y}\), which are typically accessed using the image coordinate \(\varvec{x} = \begin{bmatrix} x&y\end{bmatrix}^\top \) and the image function \(\varvec{y} = \varvec{I}(\varvec{x})\), are described using the line coordinate r and the correspondence line function \(\varvec{y} = \varvec{l}(r)\). Correspondence lines are located in the image and remain fixed once they have been established. The location and orientation of each correspondence line is defined by a center \(\varvec{c} = \begin{bmatrix} c_{x}&c_{y}\end{bmatrix}^\top \in {\mathbb {R}}^2\) in image coordinates and a normal vector \(\varvec{n} = \begin{bmatrix} n_{x}&n_{y}\end{bmatrix}^\top \in {\mathbb {R}}^2\), with \(\Vert \varvec{n}\Vert _2 = 1\). Using this definition, the relation between an image \(\varvec{I}\) and a correspondence line \(\varvec{l}\) is expressed as follows

$$\begin{aligned} \varvec{l}(r) = \varvec{I}(\varvec{c} + r \varvec{n}), \end{aligned}$$
(1)

where image coordinates in \(\varvec{I}\) are rounded to the center of the next closest pixel.

2.2 Probabilistic Model

Inspired by the generative model of Bibby and Reid (2008), we derive a probabilistic model for the segmentation of a correspondence line into a foreground region \(\omega _\text {f}\) and a background region \(\omega _\text {b}\). Note that this is the 1D equivalent of the segmentation of a 2D image into the regions \(\varOmega _\text {f}\) and \(\varOmega _\text {b}\). We assume that there is only one transition between foreground and background. The location of this transition relative to the line center \(\varvec{c}\) is described by the contour distance \(d \in {\mathbb {R}}\). A visualization of the contour distance is shown in Fig. 2.

To derive the probabilistic model, we first describe the formation process for a single pixel on the correspondence line. The joint probability distribution writes thereby as follows

$$\begin{aligned} p(r, \varvec{y}, d, m) = p(r\mid d, m) p(\varvec{y} \mid m) p(m) p(d), \end{aligned}$$
(2)

where \(m \in \{m_\text {f}, m_\text {b}\}\) is the model parameter that can denote either foreground or background. If we condition this distribution on the image value \(\varvec{y}\), we obtain

$$\begin{aligned} p(r, d, m \mid \varvec{y}) = p(r\mid d, m) p(m \mid \varvec{y}) p(d). \end{aligned}$$
(3)

Following Bibby and Reid (2008), we use Bayes’ theorem and the marginalization over m to calculate the pixel-wise posterior probability

$$\begin{aligned} p(m_i \mid \varvec{y}) = \frac{p(\varvec{y}\mid m_i) p(m_i)}{\sum _{j\in \{\text {f},\text {b}\}}p(\varvec{y} \mid m_j)p(m_j)} , \quad i\in \{\text {f}, \text {b}\}, \end{aligned}$$
(4)

where \(p(\varvec{y} \mid m_\text {f})\) and \(p(\varvec{y} \mid m_\text {b})\) are probability distributions that describe how likely it is that a specific color value is part of the foreground region or the background region, respectively. The two distributions can be estimated by calculating two color histograms, one over the foreground region and one over the background region. A detailed explanation of their computation is given in Sect. 4.2. Using the knowledge that foreground and background are equally likely along the correspondence line, i.e. \( p(m_\text {f}) =p(m_\text {b})\), we obtain

$$\begin{aligned} p(m_i \mid \varvec{y}) = \frac{p(\varvec{y}\mid m_i)}{p(\varvec{y} \mid m_\text {f}) + p(\varvec{y} \mid m_\text {b})} , \quad i\in \{\text {f}, \text {b}\}. \end{aligned}$$
(5)

Finally, based on Eq. (3), we are able to marginalize over m and condition on r to express the posterior probability for the contour distance d as

$$\begin{aligned} p(d\mid r, \varvec{y}) = \frac{1}{p(r)} \sum _{i \in \{\text {f}, \text {b}\}}p(r \mid d, m_i) p(m_i \mid \varvec{y}) p(d). \end{aligned}$$
(6)

To calculate the posterior probability over the entire correspondence line domain \(\omega \), we assume pixel-wise independence and, based on Eq. (6), write

$$\begin{aligned} p(d\mid \omega , \varvec{l}) \propto \prod _{r\in \omega }\sum _{i \in \{\text {f}, \text {b}\}}p(r \mid d, m_i) p(m_i \mid \varvec{l}(r)). \end{aligned}$$
(7)

Note that p(r) and p(d) are considered to be uniform and constant and are thus dropped. Also, while pixel-wise independence does not hold in general, it is a well-established approximation that allows us to avoid ill-defined assumptions for spatial regularities and is close enough to reality to yield good results. The conditional line coordinate probability \(p(r\mid d,m)\) will be discussed in Sect. 2.4. Similar to the probabilistic model of Bibby and Reid (2008), which describes the probability of a shape kernel given information from an image, Eq. (7) provides the probability of the contour distance d given data from a correspondence line.

2.3 Discrete Scale-Space Formulation

Estimating the distribution of posterior probabilities is computationally expensive since, for each distance d, the product in Eq. (7) has to be computed over the entire domain \(\omega \). This results in quadratic complexity for the calculation of the entire distribution. In contrast, pixel-wise posterior probabilities \(p(m\mid \varvec{y})\) are used in the posterior probability calculation of multiple distances d, leading to linear complexity. Shifting computation from the calculation of the distribution to the calculation of pixel-wise posterior probabilities thus allows us to improve computational efficiency. Also, it is advantageous to normalize correspondence lines in a way that ensures that a line coordinate pointing to a segment center for one correspondence line points to a segment center for all correspondence lines. This uniformity can be used in the precalculation of smoothed step function values to further improve efficiency.

Fig. 3
figure 3

Example of the relation between the unscaled space r along the correspondence line and the scale-space \(r_\text {s}\). Neighboring pixels that are combined into segments are visualized by the same color in blue or yellow. Blue and yellow dots indicate the center of each segment and the corresponding discretized value in the scale-space. An example of the contour distance is illustrated in red. The offset \(\varDelta r\) is chosen in a way that ensures that discretized values in the scale-space are the same for all correspondence lines. In this example, \(\varDelta r\) points to the closest edge between pixels (Color figure online)

In the following, we thus adopt the discrete scale-space formulation from our previous method (Stoiber et al. 2020) to combine multiple pixels into segments. In addition, the formulation projects from the continuous space along the correspondence line into a discrete space that is independent of a correspondence line’s location and orientation. An illustration of this transformation is shown in Fig. 3. Both line coordinates and contour distances are projected as follows

$$\begin{aligned} r_\text {s}= & {} (r - \varDelta r) \frac{\bar{n}}{s}, \end{aligned}$$
(8)
$$\begin{aligned} d_{\text {s}}= & {} (d - \varDelta r) \frac{\bar{n}}{s}, \end{aligned}$$
(9)

with \(s \in {\mathbb {N}}^+\) the scale that describes the number of pixels combined into a segment, \(\bar{n} = \max (|n_{x}|,|n_{y}|)\) the major absolute normal component that projects a correspondence line to the closest horizontal or vertical image coordinate, and \(\varDelta r \in {\mathbb {R}}\) the offset from the correspondence line center \(\varvec{c}\) to a defined pixel location.

Based on Eq. (7), the posterior probability in the discrete scale-space is calculated as

$$\begin{aligned} p\left( d_\text {s}\mid \omega _\text {s}, \varvec{l}_\text {s}\right) \propto \prod _{r_\text {s} \in \omega _\text {s}}\sum _{i \in \{\text {f}, \text {b}\}}p\left( r_\text {s} \mid d_\text {s}, m_i\right) p\left( m_i \mid \varvec{l}_\text {s}\left( r_\text {s}\right) \right) , \end{aligned}$$
(10)

where \(\omega _\text {s}\) is the scaled correspondence line domain and \(\varvec{s} = \varvec{l}_\text {s}(r_\text {s})\) a set-valued function that maps from the scaled line coordinate \(r_\text {s}\) to the segment \(\varvec{s}\), which is a set of the closest s pixel values \(\varvec{y}\). Similar to pixel-wise posteriors in Eq. (5) and assuming pixel-wise independence, segment-wise posteriors are defined as

$$\begin{aligned} p\left( m_i \mid \varvec{s}\right) = \frac{\prod \limits _{\varvec{y} \in \varvec{s}} p\left( \varvec{y}\mid m_i\right) }{\prod \limits _{\varvec{y} \in \varvec{s}} p\left( \varvec{y} \mid m_\text {f}\right) + \prod \limits _{\varvec{y} \in \varvec{s}} p\left( \varvec{y} \mid m_\text {b}\right) } , \quad i\in \left\{ \text {f}, \text {b}\right\} . \end{aligned}$$
(11)

The derived formulation allows to efficiently cover the correspondence line domain \(\omega \), using the scale parameter s to set the segment size and to adjust between accuracy and efficiency. In the following, we will again drop the index \(\text {s}\) for all variables to simplify the notation. Note, however, that all definitions and derivations are valid both for the original space and for the discrete scale-space formulation.

2.4 Smoothed Step Functions

To model the conditional probabilities of the line coordinate \(p(r\mid d,m_\text {f})\) and \(p(r\mid d,m_\text {b})\), different smoothed step functions \(h_\text {f}\) and \(h_\text {b}\) have been used. While most state-of-the-art algorithms (Zhong et al. 2020b; Tjaden et al. 2018) use a function based on the arctangent, we previously proved that a hyperbolic tangent results in a Gaussian distribution for the posterior probability \(p(d\mid \omega , \varvec{l})\) (Stoiber et al. 2020). In both functions, the smoothed slope is used to model a local uncertainty with respect to the exact location of the foreground and background transition. Considering the plots of the two models in Fig. 4, one notices that the functions quickly converge towards either zero or one for increasing absolute values of \(x = r - d\). Except for a small area around zero, both models thus assume that, given the model m and the contour distance d, one knows perfectly on which side of the contour the line coordinate r lies. In the following, we will argue that for real-world applications, this assumption is wrong.

Fig. 4
figure 4

Smoothed step functions \(h_\text {f}\) and \(h_\text {b}\) that model the conditional line coordinate probabilities \(p(r\mid d,m_\text {f})\) and \(p(r\mid d,m_\text {b})\). The functions \(h_\text {f}(x) = \frac{1}{2} - \frac{1}{\pi }\tan ^{-1}\big (\frac{x}{s_\text {h}}\big )\) and \(h_\text {b}(x) = \frac{1}{2} + \frac{1}{\pi }\tan ^{-1}\big (\frac{x}{s_\text {h}}\big )\) used by Zhong et al. (2020b) and Tjaden et al. (2018) are illustrated by dash-dotted gray lines. The definitions \(h_\text {f}(x) = \frac{1}{2} - \frac{1}{2}\tanh \big (\frac{x}{2s_\text {h}}\big )\) and \(h_\text {b}(x) = \frac{1}{2} + \frac{1}{2}\tanh \big (\frac{x}{2s_\text {h}}\big )\) from our previous work (Stoiber et al. 2020) are shown as dashed yellow lines. In both plots, the slope parameter \(s_\text {h}=1\) is used. For the proposed functions in Eqs. (12) and (13), solid red lines correspond to \(\alpha _\text {h} = \frac{1}{3}\) and \(s_\text {h} = 1\), while dotted blue lines show the functions for \(\alpha _\text {h} = \frac{1}{3}\) and \(s_\text {h} \rightarrow 0\). In addition to visualizing the definitions from our previous work, the dashed yellow lines illustrate the proposed functions for \(\alpha _\text {h} = \frac{1}{2}\) and \(s_\text {h}=1\) (Color figure online)

While the pixel-wise posterior probability in Eq. (5) provides very good predictions for the model m, it is still an imperfect simplification of the real world. Typical effects that are not considered by the statistical model are image noise or fast appearance changes that can lead to pixel colors that are not yet present in the color histograms. Another effect originates from pixels that are wrongly classified due to imperfect segmentation and that are then assigned to the wrong color histograms. Finally, there also remains the question if a statistical model that purely relies on pixel colors is sufficient to capture all the statistical effects in the real world and is able to perfectly predict the model m.

To take those limitations into account and consider a constant, global uncertainty in the foreground and background model, we extend the formulation from our previous work (Stoiber et al. 2020) and propose the following functions

$$\begin{aligned} h_\text {f}(x)= & {} \frac{1}{2} - \alpha _\text {h}\tanh \bigg (\frac{x}{2s_\text {h}}\bigg ), \end{aligned}$$
(12)
$$\begin{aligned} h_\text {b}(x)= & {} \frac{1}{2} + \alpha _\text {h}\tanh \bigg (\frac{x}{2s_\text {h}}\bigg ). \end{aligned}$$
(13)

Note that the amplitude parameter \(\alpha _\text {h} \in [0,0.5]\) was added to the original definitions that only considered the slope parameter \(s_\text {h} \in {\mathbb {R}}^+\). For \(\alpha _\text {h} = \frac{1}{2}\) the equations are equal to our previous formulation. Examples of the proposed functions are shown in Fig. 4.

In addition to viewing \(\alpha _\text {h}\) as a simple amplitude parameter, we are able to demonstrate that there is also another interpretation. For this, we assume that the model m is extended with a third class \(m_\text {n}\) that considers external effects that are independent of the foreground and background model \(m_\text {f}\) and \(m_\text {b}\). For this scenario, we can show that \(p(m_\text {f}) = p(m_\text {b}) = \alpha _\text {h}\) and that \(p(m_\text {n}) = 1 - 2\alpha _\text {h}\). Following this interpretation, the amplitude parameter thus allows us to set the probability that a pixel’s color is generated by the foreground or background model in contrast to some other effect that is considered as noise. This again shows that the amplitude parameter \(\alpha _\text {h}\) is able to model a constant, global uncertainty. Note that in this scenario, the original smoothed step functions that converge to zero or one are used for \(p(r\mid d,m_\text {f})\) and \(p(r\mid d,m_\text {b})\) and a constant function \(p(r\mid d,m_\text {n}) = \frac{1}{2}\) is adopted for the noise model. A detailed derivation of this extended model and a proof of its equivalence to the use of the functions in Eqs. (12) and (13) is given in Appendix A.

2.5 Posterior Probability Distribution

Given the smoothed step functions \(h_\text {f}\) and \(h_\text {b}\) that model the conditional line coordinate probabilities \(p(r\mid d,m_\text {f})\) and \(p(r\mid d,m_\text {b})\), the final expression of the posterior probability distribution from Eq. (7) can be written as

$$\begin{aligned} p(d\mid \omega ,\varvec{l}) \propto \prod _{r\in \omega }h_\text {f}(r-d)p_\text {f}(r) + h_\text {b}(r-d)p_\text {b}(r), \end{aligned}$$
(14)

with the abbreviations \(p_\text {f}(r) = p(m_\text {f}\mid \varvec{l}(r))\) and \(p_\text {b}(r) = p(m_\text {b}\mid \varvec{l}(r))\). In the following, we provide a detailed analysis to understand how the slope parameter \(s_\text {h}\) and the amplitude parameter \(\alpha _\text {h}\) affect this distribution. We thereby assume a contour at the correspondence line center and step functions for the pixel-wise posteriors \(p_\text {f}\) and \(p_\text {b}\). Note that the assumption of step functions corresponds well with real-world experiments that show that, in most cases, there is a distinct split between foreground and background.

For the analysis, we start with the calculation of the first-order derivative of the log-posterior with respect to the contour distance d. The derivation is conducted similar to our previous work (Stoiber et al. 2020) and assumes continuous functions with infinitesimally small pixels. Based on a detailed derivation given in Appendix B, the closed-form solution is written as

$$\begin{aligned} \frac{\partial \ln \big (p(d\mid \omega ,\varvec{l})\big )}{\partial d} = -2\tanh ^{-1}\bigg (2 \alpha _\text {h} \tanh \bigg (\frac{d}{2s_\text {h}}\bigg )\bigg ). \end{aligned}$$
(15)

A visualization of this function for different slope and amplitude parameters \(\alpha _\text {h}\) and \(s_\text {h}\) is given in Fig. 5. The plot shows that the amplitude parameter \(\alpha _\text {h}\) controls not only the amplitude of \(h_\text {f}\) and \(h_\text {b}\) but also the amplitude of the first-order derivative. For \(\alpha _\text {h} = \frac{1}{2}\), the first-order derivative converges to a linear function. At the same time, the parameter \(s_\text {h}\) affects both the slope of \(h_\text {f}\) and \(h_\text {b}\) and the slope of the first-order derivative. For \(s_\text {h} \rightarrow 0\) it leads to a perfect step function.

Fig. 5
figure 5

First-order derivatives of the log-posterior with respect to the contour distance d for different slope and amplitude parameters \(s_\text {h}\) and \(\alpha _\text {h}\). The solid red line shows the derivative for \(\alpha _\text {h} = \frac{1}{3}\) and \(s_\text {h} = 1\), which yields a function with a smooth transition from an upper bound to a lower bound. The dashed yellow line shows the function for \(\alpha _\text {h} = \frac{1}{2}\) and \(s_\text {h} = 1\). This produces a linear first-order derivative. Finally, using \(\alpha _\text {h} = \frac{1}{3}\) and \(s_\text {h} \rightarrow 0\) results in a perfect step function illustrated by the dotted blue line (Color figure online)

For the two edge cases with \(\alpha _\text {h} = \frac{1}{2}\) and \(s_\text {h} \rightarrow 0\), Eq. (15) can be simplified, and we are able to calculate a closed-form solution for the posterior probability distribution. In the case of \(\alpha _\text {h} = \frac{1}{2}\), for which we obtain the smoothed step functions of our previous approach (Stoiber et al. 2020), the posterior probability distribution results in a perfect Gaussian

$$\begin{aligned} p(d\mid \omega ,\varvec{l}) = \frac{1}{\sqrt{2\pi s_\text {h}}}\exp \bigg (-\frac{d^2}{2s_\text {h}}\bigg ), \end{aligned}$$
(16)

where the slope parameter \(s_\text {h}\) is equal to the variance. In the case of \(s_\text {h} \rightarrow 0\), which leads to sharp step functions for \(h_\text {f}\) and \(h_\text {b}\), the posterior probability distribution becomes a perfect Laplace distribution

$$\begin{aligned} p(d\mid \omega ,\varvec{l}) = \frac{1}{2b}\exp \bigg (-\frac{|d|}{b}\bigg ),\quad b=\frac{1}{2\tanh ^{-1}(2\alpha _\text {h})},\nonumber \\ \end{aligned}$$
(17)

where \(b \in {\mathbb {R}}^+\) is the scale parameter of the Laplace distribution that depends on \(\alpha _\text {h}\). A detailed derivation of the two functions is provided in Appendix C. Examples for both distributions, as well as a mixed posterior distribution with \(s_\text {h}=1\) and \(\alpha _\text {h}=\frac{1}{3}\), are visualized in Fig. 6. The plot shows that while the Laplace distribution has a pronounced peak, the Gaussian distribution has a smoothed maximum for which nearby values have similarly high probabilities. This coincides with our intuition that the slope parameter \(s_\text {h}\) controls local uncertainty, allowing multiple values d to be almost equally likely. At the same time, the amplitude parameter \(\alpha _\text {h}\) controls the size of the peak compared to its surroundings, thereby controlling global uncertainty. Combining the two parameters in a mixed distribution results in a function that is able to consider both local and global uncertainty simultaneously. Given the detailed knowledge about correspondence lines and the posterior probability distribution, we are now able to develop SRT3D, a highly efficient, sparse approach to region-based 3D object tracking.

Fig. 6
figure 6

Posterior probability distributions for different slope and amplitude parameters \(s_\text {h}\) and \(\alpha _\text {h}\). The solid red line shows the function for \(\alpha _\text {h} = \frac{1}{3}\) and \(s_\text {h} = 1\), which leads to a very flat distribution. Note that the function was computed numerically. Using \(\alpha _\text {h} = \frac{1}{2}\) and \(s_\text {h} = 1\) results in a Gaussian distribution shown by the dashed yellow line. The parameters \(\alpha _\text {h} = \frac{1}{3}\) and \(s_\text {h} \rightarrow 0\) yield a Laplace distribution for the posterior probability that is illustrated by a dotted blue line (Color figure online)

3 Region-Based 3D Tracking

In this section, we first define basic mathematical concepts. This is followed by the description of a sparse viewpoint model, which avoids the rendering of the 3D model during tracking. Combining this geometry representation with the correspondence line model developed in the previous section, we are able to formulate a joint posterior probability with respect to the pose. The probability is maximized using Newton optimization with Tikhonov regularization. Finally, we define the required gradient vector and Hessian matrix for the Newton method. We thereby differentiate between global and local optimization to ensure both fast convergence and high accuracy.

3.1 Preliminaries

In the following work, we define 3D model points as \(\varvec{X} = \begin{bmatrix} X&Y&Z\end{bmatrix}^\top \in {\mathbb {R}}^3\) and use the tilde notation to write the homogeneous form \(\varvec{\widetilde{X}} = \begin{bmatrix} X&Y&Z&1\end{bmatrix}^\top \). For the projection of a 3D model point \(\varvec{X}\) into the image space, we assume an undistorted image and use the pinhole camera model

$$\begin{aligned} \varvec{x} = \varvec{\pi }(\varvec{X}) = \begin{bmatrix} \frac{X}{Z}f_x + p_x\\ \frac{Y}{Z}f_y + p_y \end{bmatrix}, \end{aligned}$$
(18)

with \(f_x\) and \(f_y\) the focal lengths and \(p_x\) and \(p_y\) the principal point coordinates given in units of pixels. The inverse operation, which is the reconstruction of a 3D model point from an image coordinate \(\varvec{x}\) and corresponding depth value \(d_Z\) along the optical axis, can be written as

$$\begin{aligned} \varvec{X} = \varvec{\pi }^{-1}(\varvec{x}, d_Z) = d_Z \begin{bmatrix} \frac{x-p_x}{f_x}\\ \frac{y-p_y}{f_y}\\ 1 \end{bmatrix}. \end{aligned}$$
(19)

To describe the relative pose between the model reference frame \(\text {M}\) and the camera reference frame \(\text {C}\), we use the homogeneous matrix \({}_\text {C}\varvec{T}_\text {M} \in \mathbb {SE}(3)\). For the transformation of a 3D model point, we can then write

$$\begin{aligned} _\text {C}\varvec{\widetilde{X}} = {}_\text {C}\varvec{T}_\text {M}{}_\text {M}\varvec{\widetilde{X}} = \begin{bmatrix} _\text {C}\varvec{R}_\text {M} &{} _\text {C}\varvec{t}_\text {M} \\ \varvec{0} &{} 1 \end{bmatrix} {}_\text {M}\varvec{\widetilde{X}}, \end{aligned}$$
(20)

where \(_\text {C}\varvec{\widetilde{X}}\) and \(_\text {M}\varvec{\widetilde{X}}\) are 3D model points written in the camera reference frame \(\text {C}\) and the model reference frame \(\text {M}\), respectively, and where \(_\text {C}\varvec{R}_\text {M} \in \mathbb {SO}(3)\) and \(_\text {C}\varvec{t}_\text {M} \in {\mathbb {R}}^3\) are the rotation matrix and the translation vector that define the transformation from \(\text {M}\) to \(\text {C}\). An illustration of the two reference frames and a homogeneous transformation matrix is given in Fig. 7.

Fig. 7
figure 7

Illustration of a 2D rendering computed from a 3D mesh model. The model reference frame \(\text {M}\) is shown at the center of the object, while a camera reference frame \(\text {C}\) is shown at the right upper corner of the image. The transformation from the model to the camera reference frame that is described by \({}_\text {C}\varvec{T}_\text {M}\) is indicated by a dashed arrow. The contour of the rendered model is highlighted by a yellow line. Red points and blue arrows illustrate 2D contour points and approximated normal vectors (Color figure online)

For small variations, the angle-axis representation, which is a minimal representation, is used. With the exponential map, the rotation matrix writes as

$$\begin{aligned} \varvec{R} = \exp ([\varvec{r}]_\times ) = \varvec{I} + [\varvec{r}]_\times + \frac{1}{2!}[\varvec{r}]_\times ^2 + \frac{1}{3!}[\varvec{r}]_\times ^3 + ..., \end{aligned}$$
(21)

where \([\varvec{r}]_\times \) is the skew-symmetric matrix of \(\varvec{r} \in {\mathbb {R}}^3\). By neglecting higher-order terms of the series expansion, Eq. (21) can be linearized. We are then able to write the linear variation of a 3D model point in the camera reference frame \(\text {C}\) as

$$\begin{aligned} _\text {C}\varvec{\widetilde{X}}(\varvec{\theta }) = \begin{bmatrix} _\text {C}\varvec{R}_\text {M} &{} _\text {C}\varvec{t}_\text {M} \\ \varvec{0} &{} 1 \end{bmatrix} \begin{bmatrix} \varvec{I} + \left[ \varvec{\theta }_\text {r}\right] _\times &{} \varvec{\theta }_\text {t} \\ \varvec{0} &{} 1 \end{bmatrix} {}_\text {M}\varvec{\widetilde{X}}, \end{aligned}$$
(22)

with the rotational variation \(\varvec{\theta }_\text {r} \in {\mathbb {R}}^3\), the translational variation \(\varvec{\theta }_\text {t} \in {\mathbb {R}}^3\), and the full variation vector \(\varvec{\theta }^\top = \begin{bmatrix} \varvec{\theta }_\text {r}^\top&\varvec{\theta }_\text {t}^\top \end{bmatrix}\). Note that, since the object is typically moved significantly more than the camera, it is more natural to variate 3D points in the model reference frame \(\text {M}\) instead of the camera reference frame \(\text {C}\). Also, the variation in the model reference frame has the advantage that a simple extension of the algorithm to multiple cameras is possible.

3.2 Sparse Viewpoint Model

In contrast to most state-of-the-art region-based methods, we do not use the 3D geometry in the form of a mesh model directly. Instead, similar to Tan et al. (2017), we employ a representation that we call a sparse viewpoint model. To create this model, the 3D geometry is rendered from a number of \(n_\text {v}\) viewpoints all around the object. Virtual cameras are thereby placed on the vertices of a geodesic grid that surrounds the object. For each rendering, \(n_\text {c}\) points \(\varvec{x}_i\in {\mathbb {R}}^2\) are randomly sampled from the contour of the model. Subsequently, the vectors \(\varvec{n}_i\in {\mathbb {R}}^2\) that are normal to the contour are approximated for each point. Note that \(\Vert \varvec{n}_i\Vert _2 = 1\). An illustration of a rendering with sampled 2D contour points and normal vectors is shown in Fig. 7. Based on the 2D entities, 3D vectors with respect to the model reference frame are then reconstructed as follows

$$\begin{aligned} {}_\text {M}\varvec{\widetilde{X}}_i= & {} {}_\text {M}\varvec{T}_\text {C} \ \varvec{\tilde{\pi }}^{-1}\left( \varvec{x}_i, d_{Zi}\right) , \end{aligned}$$
(23)
$$\begin{aligned} {}_\text {M}\varvec{N}_i= & {} {}_\text {M}\varvec{R}_\text {C} \begin{bmatrix} \varvec{n}_i\\ 0 \end{bmatrix}, \end{aligned}$$
(24)

where the tilde notation in \(\varvec{\tilde{\pi }}^{-1}\) indicates that the 3D model point is returned in homogeneous form and \(d_{Zi}\) is the depth value from the rendering. Note that in this case, \(\text {C}\) denotes the reference frame of the virtual camera from which the rendering was created. In addition to those vectors, we also compute the orientation vector \({}_\text {M}\varvec{v} = {}_\text {M}\varvec{R}_\text {C}\varvec{e}_\text {Z}\) that points from the camera to the model center, where \(\varvec{e}_\text {Z} = \begin{bmatrix}0&0&1\end{bmatrix}^\top \). The computed 3D model points, normal vectors, and the orientation vector are then stored for each view.

The sparse viewpoint model allows for a highly efficient representation of the model contour. Given a specific pose with \({}_\text {M}\varvec{R}_\text {C}\) and \({}_\text {C}\varvec{t}_\text {M}\), the process of rendering the model and computing the contour reduces to a simple search for the closest precomputed view \(i_\text {v}\)

$$\begin{aligned} i_\text {v} = {{\,\mathrm{\mathop {\mathrm{arg}\,\mathrm{max}}}\,}}_{i \in \{1, \dots ,n_\text {v}\}}( {}_\text {M}\varvec{v}_{i}^\top {}_\text {M}\varvec{R}_\text {C} {}_\text {C}\varvec{t}_\text {M}), \end{aligned}$$
(25)

and the subsequent projection of the corresponding 3D model points and normal vectors into the image. Note that this high efficiency is especially important during the optimization of the joint posterior probability, where the pose changes in each iteration.

3.3 Joint Posterior Probability

In the following, we combine the developed sparse viewpoint model with the correspondence line model from Sect. 2 to define a joint posterior probability with respect to the pose variation. However, before probabilities can be calculated, the location and orientation of correspondence lines need to be defined. For this, 3D model points and normal vectors from the closest view of the sparse viewpoint model are projected into the image using the following equations

(26)
(27)

where the normal vector \(\varvec{n}_i\) is normalized to \(||\varvec{n}_{i}||_{2}=1\) and ()\(_{2\times 1}\) denotes the first two elements of a vector.

Once all correspondence lines have been defined, we are able to variate the current pose and calculate contour distances \(d_i\) with respect to the pose variation vector \(\varvec{\theta }\). Contour distances are thereby calculated as the distances along normal vectors \(\varvec{n}_i\) from correspondence line centers \(\varvec{c}_i\) to projected 3D model points \(\varvec{X}_i\)

$$\begin{aligned} d_i(\varvec{\theta }) = \varvec{n}_i^\top \big (\varvec{\pi }({}_\text {C}\varvec{X}_i(\varvec{\theta }))-\varvec{c}_i\big ). \end{aligned}$$
(28)

Note that the same 3D model points \(\varvec{X}_i\) are used as for the definition of correspondence lines. Also, while we do not write this explicitly, 3D model points \({}_\text {C}\varvec{X}_i\) and contour distances \(d_i\) also depend on the current pose estimate \({}_\text {C}\varvec{T}_\text {M}\), which might be different from the pose that was used to define correspondence lines. An example of multiple correspondence lines with variated contour distances is shown in Fig. 8.

Fig. 8
figure 8

Correspondence lines defined by a center \(\varvec{c}_i\) and a normal vector \(\varvec{n}_i\). Variated contour distances \(d_i\) are measured along the correspondence lines from the centers \(\varvec{c}_i\) to the projected 3D model points \({}_\text {C}\varvec{X}_i\) that depend all on the same pose variation \(\varvec{\theta }\). The object contour of the original pose estimate, which was used to define the correspondence lines, is indicated by a dotted line. The current estimate of the contour that depends on the pose variation vector \(\varvec{\theta }\) is shown by a dashed line. The ground truth segmentation that we try to estimate is given by the foreground region \(\varvec{\varOmega }_\text {f}\) in yellow and the background region \(\varvec{\varOmega }_\text {b}\) in blue. Note that while contours are illustrated as continuous lines, in our method, they are represented by points and normal vectors from the closest view of the sparse viewpoint model (Color figure online)

Finally, assuming a number of \(n_\text {c}\) independent correspondence lines and using the discrete scale-space formulation from Sect. 2.3 to improve efficiency, the joint posterior probability can be calculated as

$$\begin{aligned} p(\varvec{\theta }\mid \varvec{\mathcal {D}}) \propto \prod _{i=1}^{n_\text {c}}p(d_{\text {s}i}(\varvec{\theta })\mid \omega _{\text {s}i},\varvec{l}_{\text {s}i}), \end{aligned}$$
(29)

where \(\varvec{\mathcal {D}}\) describes the data from all correspondence lines. Note that the transformation of contour distances \(d_i\) from the original space to the discrete scale-space is given by Eq. (9). The developed joint posterior probability describes how well the current pose estimate explains the segmentation of the image into a foreground region, that corresponds to the tracked object, and a background region.

3.4 Optimization

To maximize the joint posterior probability, we estimate the variation vector \(\varvec{\hat{\theta }}\) and iteratively update the pose. For a single iteration, the variation vector is calculated using the Newton method with Tikhonov regularization

$$\begin{aligned} \pmb {\hat{\theta }} = \bigg (-\pmb {H} + \begin{bmatrix} \lambda _\text {r} \pmb {I}_3 &{} \pmb {0}\\ \pmb {0} &{} \lambda _\text {t} \pmb {I}_3 \end{bmatrix} \bigg )^{-1}\pmb {g}, \end{aligned}$$
(30)

where \(\pmb {g}\) is the gradient vector, \(\pmb {H}\) is the Hessian matrix, \(\pmb {I}_3\) the \(3\times 3\) identity matrix, and \(\lambda _\text {r}\) and \(\lambda _\text {t}\) are the regularization parameters for rotation and translation, respectively. The gradient vector and the Hessian matrix are defined as the first- and second-order derivatives of the joint log-posterior

$$\begin{aligned} \varvec{g}^\top= & {} \frac{\partial }{\partial \varvec{\theta }} \ln \big (p(\varvec{\theta }\mid \varvec{\mathcal {D}})\big ) \Big \vert _{\varvec{\theta }=\varvec{0}}, \end{aligned}$$
(31)
$$\begin{aligned} \varvec{H}= & {} \frac{\partial ^2}{\partial \varvec{\theta }^2} \ln \big (p(\varvec{\theta }\mid \varvec{\mathcal {D}})\big )\Big \vert _{\varvec{\theta }=\varvec{0}}. \end{aligned}$$
(32)

Using the logarithm has the advantage that scaling terms vanish and products turn into summations. Note that the Hessian represents the curvature of the distribution at a specific location, which for Gaussian probability functions is constant and directly corresponds to the negative inverse variance. Given this probabilistic interpretation, it can be argued that regularization parameters correspond to a prior probability. This prior controls how much we believe in the previous pose compared to the current estimate described by the gradient and Hessian. Consequently, for directions in which the Hessian indicates high uncertainty, the regularization helps to keep the optimization stable and to avoid pose changes that are not supported by sufficient data.

Finally, given a robust estimate for the variation vector, the predicted pose can be updated as follows

$$\begin{aligned} _\text {C}\varvec{T}_\text {M} = {}_\text {C}\varvec{T}_\text {M} \begin{bmatrix} \exp ([\varvec{\hat{\theta }}_\text {r}]_\times ) &{} \varvec{\hat{\theta }}_\text {t} \\ \varvec{0} &{} 1 \end{bmatrix}. \end{aligned}$$
(33)

Because of the exponential map, no orthonormalization is necessary. By iteratively repeating this process, we are able to optimize towards the pose that best explains the segmentation of the image.

3.5 Gradient and Hessian Approximation

In the following, the gradient vector and the Hessian matrix are approximated in a way that ensures both fast convergence and high accuracy. Using the chain rule, we write

$$\begin{aligned} \varvec{g}^\top= & {} \sum _{i=1}^{n_\text {c}} \frac{\partial \ln \left( p\left( d_{\text {s}i}\mid \omega _{\text {s}i},\varvec{l}_{\text {s}i}\right) \right) }{\partial d_{\text {s}i}} \frac{\partial d_{\text {s}i}}{\partial {}_\text {C}\varvec{X}_{i}} \frac{\partial {}_\text {C}\varvec{X}_{i}}{\partial \varvec{\theta }} \bigg \vert _{\varvec{\theta }=\varvec{0}}, \end{aligned}$$
(34)
$$\begin{aligned} \begin{aligned}&\varvec{H} \approx \sum _{i=1}^{n_\text {c}} \frac{\partial ^2\ln \left( p\left( d_{\text {s}i}\mid \omega _{\text {s}i},\varvec{l}_{\text {s}i}\right) \right) }{\partial {d_{\text {s}i}}^2} \left( \frac{\partial d_{\text {s}i}}{\partial {}_\text {C}\varvec{X}_{i}}\frac{\partial {}_\text {C}\varvec{X}_{i}}{\partial \varvec{\theta }} \right) ^\top \\&\qquad \quad \qquad \left( \frac{\partial d_{\text {s}i}}{\partial _\text {C}\varvec{X}_{i}} \frac{\partial _\text {C}\varvec{X}_{i}}{\partial \varvec{\theta }}\right) \bigg \vert _{\varvec{\theta }=\varvec{0}}. \end{aligned} \end{aligned}$$
(35)

Note that for the Hessian matrix, second-order partial derivatives with respect to \(d_{\text {s}i}\) and \({}_\text {C}\varvec{X}_{i}\) are neglected. Resulting errors are left to the iterative nature of the optimization. Using Eq. (22), the first-order derivative of the 3D model point \({}_\text {C}\varvec{X}_{i}\) is calculated as

$$\begin{aligned} \frac{\partial _\text {C}\varvec{X}_{i}}{\partial \varvec{\theta }} = {}_\text {C}\varvec{R}_\text {M} \begin{bmatrix} -\left[ {}_\text {M}\varvec{X}_i\right] _\times&\,&\varvec{I}_3 \end{bmatrix}. \end{aligned}$$
(36)

With respect to the scaled contour distance \(d_{\text {s}i}\), both Eq. (28) and (9) are used to write

$$\begin{aligned} \begin{aligned} \frac{\partial d_{\text {s}i}}{\partial _\text {C}\varvec{X}_{i}} = \frac{\bar{n}_i}{s} \frac{1}{{} _\text {C}Z_{i}^2} \big [&\begin{matrix} n_{xi} f_x {}_\text {C}Z_{i}&\,&n_{yi} f_y {}_\text {C}Z_{i} \end{matrix}\\&\begin{matrix} -n_{xi} f_x {}_\text {C}X_{i} - n_{yi} f_y {}_\text {C}Y_{i} \end{matrix} \big ]. \end{aligned} \end{aligned}$$
(37)

For the calculation of the required first- and second-order derivatives of the log-posterior, we differentiate between global and local optimization. To some extent, this is similar to our previous approach (Stoiber et al. 2020). However, in contrast to that work, we propose different approximations for the local optimization. Also, we either apply global or local optimization and use the same definition of derivatives for all correspondence lines instead of mixing them.

In the case of global optimization, the posterior probability distribution of individual correspondence lines is approximated by a normal distribution \(\mathcal {N}(d_{\text {s}i}\mid \mu _i, \sigma _i^2)\). The required mean and standard deviation \(\mu _i\) and \(\sigma _i\) are thereby estimated from a set of discretized contour distances \(d_{\text {s}i}\) and their corresponding probability values. An example of the approximation of a discrete posterior probability distribution is shown in Fig. 9. Based on the normal distribution, the first- and second-order derivatives are calculated as

$$\begin{aligned}&\frac{\partial \ln \left( p\left( d_{\text {s}i}\mid \omega _{\text {s}i},\varvec{l}_{\text {s}i}\right) \right) }{\partial {d_{\text {s}i}}} \approx -\frac{1}{\sigma _i^2}\left( d_{\text {s}i} - \mu _i\right) , \end{aligned}$$
(38)
$$\begin{aligned}&\frac{\partial ^2\ln \left( p\left( d_{\text {s}i}\mid \omega _{\text {s}i},\varvec{l}_{\text {s}i}\right) \right) }{\partial {d_{\text {s}i}}^2} \approx -\frac{1}{\sigma _i^2}. \end{aligned}$$
(39)

The approximated derivatives direct the optimization towards the mean \(\mu _i\), using the variance \(\sigma _i^2\) to consider uncertainty. Note that while in the real world, the mean does not exactly coincide with the maximum, it is typically quite close. At the same time, using the approximation has the advantage of fast convergence and that the optimization avoids local minima resulting from invalid pixel-wise posteriors and image noise.

Fig. 9
figure 9

Discrete posterior probability distribution with noisy measurements. For global optimization, the distribution is approximated by a normal distribution \(\mathcal {N}(d_{\text {s}i} \mid \mu _i, \sigma _i^2)\). The normal distribution and its mean \(\mu _i\) are illustrated in blue. In the case of local optimization, only two discrete probability values that are closest to the current estimate of the contour distance \(d_{\text {s}i}(\varvec{\theta })\) are considered. The two discrete probability values \(p(d_{\text {s}i}^-\mid \omega _{\text {s}i},\varvec{l}_{\text {s}i})\) and \(p(d_{\text {s}i}^+\mid \omega _{\text {s}i},\varvec{l}_{\text {s}i})\), which are used to approximate the first-order derivative, are colored in red (Color figure online)

Once the optimization is closer to the maximum, the global mean is not a good enough estimate, and more detailed refinement is required. In such cases, the algorithm switches to local optimization. We thereby use the probability values of the two discrete contour distances \(d_{\text {s}i}^-\) and \(d_{\text {s}i}^+\) that are closest to the current estimate \(d_{\text {s}i}(\varvec{\theta })\) and approximate the first-order derivatives using a weighting term \(\frac{\alpha _\text {s}}{\sigma _i^2}\) and finite differences

$$\begin{aligned} \frac{\partial \ln \left( p\left( d_{\text {s}i}\mid \omega _{\text {s}i},\varvec{l}_{\text {s}i}\right) \right) }{\partial {d_{\text {s}i}}} \approx \frac{\alpha _\text {s}}{\sigma _i^2}\ln \left( \frac{p\left( d_{\text {s}i}^+\mid \omega _{\text {s}i},\varvec{l}_{\text {s}i}\right) }{p\left( d_{\text {s}i}^-\mid \omega _{\text {s}i},\varvec{l}_{\text {s}i}\right) }\right) . \end{aligned}$$
(40)

For second-order derivatives, the global approximation from Eq. (39) is used. Note that weighting the first-order derivative with the variance \(\sigma _i^2\) improves robustness because correspondence lines with high uncertainty are considered less important. Simultaneously, the step size \(\alpha _\text {s}\) helps to balance the weight and specifies how far the optimization proceeds, directly scaling the variation vector \(\varvec{\hat{\theta }}\). The same first- and second-order derivatives can also be derived using inverse-variance weighting and a constant curvature of \(\frac{1}{\alpha _\text {s}}\) for the second-order derivative. A detailed derivation of this interpretation is given in Appendix D.

Finally, apart from the choice of derivatives, the parameterization of smoothed step functions and the corresponding shape of posterior probability distributions significantly influences the optimization. To study this effect, we consider the first-order derivatives of the log-posteriors that are shown in Fig. 5. While for Gaussian distributions, linear first-order derivatives lead to the estimation of the weighted mean over all correspondence lines, for Laplace distributions, binary derivatives guide the optimization towards the weighted median. Note that this again corresponds well to the interpretation of local and global uncertainty modeled by the slope parameter \(s_\text {h}\) and the amplitude parameter \(\alpha _\text {h}\). If only local uncertainty exists, it is advantageous to consider the magnitude of errors in the contour distance and optimize for the mean. At the same time, in the case of global noise, it is reasonable to only consider the direction of errors, and conduct the optimization with respect to the median.

4 Implementation

The following section provides implementation details for the developed algorithm. We thereby start with the generation of the sparse viewpoint model and the calculation of color histograms. This is followed by a description of the tracking process. Finally, we explain how known occlusions can be considered. All mentioned parameter values are carefully chosen to maximize tracking quality while not requiring unreasonable amounts of computation. Note that the source code of SRT3D is publicly available on GitHubFootnote 1 to ensure reproducibility and to allow full reusability.

4.1 Sparse Viewpoint Model

For the sparse viewpoint model, \(n_\text {v} = 2562\) different views are considered. They are generated by subdividing the triangles of an icosahedron 4 times, resulting in an angle of approximately \(4^\circ \) between neighboring views. Virtual cameras that are used for the rendering are placed at a distance of \(0.8\,\hbox {m}\) to the object center. For all views, the orientation vector \({}_\text {M}\varvec{v}\) and a constant number of \(n_\text {c}=200\) model points \({}_\text {M}\varvec{X}_i\) and normal vectors \({}_\text {M}\varvec{N}_i\) are computed. In addition, for each point and view, we also compute so-called continuous distances for the foreground and background. Continuous distances thereby describe the distance from the 2D model point \(\varvec{x}_i\) along the line defined by the normal vector \(\varvec{n}_i\) for which the foreground and background are not interrupted by each other. After their computation in the rendered image, they are converted and stored in meters. The values are later used by the tracker to disable individual correspondence lines for which continuous distances are below a certain threshold, and the assumption that only a single transition between foreground and background is present in the correspondence line is not sufficiently fulfilled.

4.2 Color Histograms

For the estimation of the color probability distributions\(p(\varvec{y}\mid m_\text {f})\) and \(p(\varvec{y}\mid m_\text {b})\), color histograms are used. Each dimension of the RGB color space is discretized by 32 equidistant bins, leading to a total of 32768 values. The computation of the color histograms is started either from the current pose estimate or from an initial pose, provided, for example, by a 3D object detection pipeline. Based on this pose, 3D model points and normal vectors are projected into the image using Eqs. (26) and (27). After an offset of one pixel, the first 18 pixels are considered in both the positive and negative direction of the normal vector. Pixel colors along this line are assigned to either the foreground or background histogram, depending on which side of the projected model point they are. Note that fewer than 18 pixels are considered if a transition between foreground and background occurs within a shorter distance. Also, in cases where the contour location is more uncertain, it is reasonable to use an offset larger than one pixel.

Due to motion or dynamic illumination, color statistics of both the foreground and background are continuously changing during tracking. To take those changes into account while at the same time considering previous observations, we use online adaptation. Based on Bibby and Reid (2008), we thereby update the histograms as follows

$$\begin{aligned} p_t\left( \varvec{y}\mid m_i\right) = \alpha _i p\left( \varvec{y}\mid m_i\right) + \left( 1 - \alpha _i\right) p_{t - 1}\left( \varvec{y}\mid m_i\right) , \end{aligned}$$
(41)

with \(i \in \{\text {f}, \text {b}\}\) and \(\alpha _\text {f} = 0.2\) and \(\alpha _\text {b} = 0.2\) the learning rates for the foreground and background, respectively. Note that \(p(\varvec{y}\mid m_i)\) is the observed histogram, while \(p_t(\varvec{y}\mid m_i)\) and \(p_{t-1}(\varvec{y}\mid m_i)\) are the adapted histograms of the current and previous time step, respectively. For initialization, we directly use the observed histograms instead of blending them with previous values.

4.3 Tracking Process

To start tracking, an initial pose is required, which is typically provided either from a 3D object detection pipeline or from dataset annotations. Based on this pose and a corresponding camera image, the color histograms for the foreground and background are initialized. After initialization, a tracking step is executed for each new image that is streamed from the camera. An overview of all computation that is performed in a single tracking step is given in Algorithm 1.

figure a

Starting from a new image and the previous pose estimate \({}_\text {C}\varvec{T}_\text {M}\), we first retrieve the closest view of the sparse viewpoint model. Model points \({}_\text {M}\varvec{X}_i\) and normal vectors \({}_\text {M}\varvec{N}_i\) are then projected into the image plane to define correspondence lines. After that, continuous distances from the sparse viewpoint model are used to reject correspondence lines with distances that are below 6 segments. For the remaining correspondence lines, the posterior probability distribution \(p(d_{\text {s}i}\mid \omega _{\text {s}i},\varvec{l}_{\text {s}i})\) is evaluated at 12 discrete values \(d_{\text {s}i}\in \{-5.5,-4.5,\dots ,5.5\}\). In the calculation, we use 8 precomputed values for the smoothed step functions \(h_\text {f}\) and \(h_\text {b}\), corresponding to \(x\in \{-3.5, -2.5,\dots ,3.5\}\). Also, a minimal offset \(\varDelta r_i\) is chosen such that the line coordinates \(r_i\) point to pixel centers while the scaled line coordinates \(r_{\text {s}i}\) ensure matching values for \(x = r_{\text {s}i} - d_{\text {s}i}\). In our case, this means that \(r_{\text {s}i} \in {\mathbb {Z}}\). Having computed the distributions, two iterations of the regularized Newton optimization are executed. For the first iteration, the global optimization is used to quickly converge towards a rough pose estimate. In the second iteration, the local optimization is employed to refine this pose, using a step size of \(\alpha _\text {s} = 1.3\). As regularization parameters, we use \(\lambda _\text {r}=5000\) and \(\lambda _\text {t}=500000\).

Fig. 10
figure 10

Overview of all objects in the RBOT dataset (Tjaden et al. 2018). Objects from the LINEMOD dataset (Hinterstoisser et al. 2013) and Rigid Pose dataset (Pauwels et al. 2013) are marked with \(^\star \) and \(^\diamond \), respectively

Fig. 11
figure 11

Images from the RBOT dataset (Tjaden et al. 2018) with one example image for the regular, dynamic light, noise, and occlusion sequence. The sequences show the ape, candy, glue, and vise objects, respectively. In addition, the occlusion sequence features a squirrel object that occludes the vise

Table 1 Tracking success rates for state-of-the-art approaches on the RBOT dataset (Tjaden et al. 2018). Methods that are not purely region-based are indicated by a \(^\star \). The best results are highlighted in bold. The second-best values are underlined

To find the final pose, the process is repeated seven times. We thereby choose larger scales of \(s=5\) for the first iteration and \(s=2\) for the second and third iterations. In all other iterations, a scale of \(s=1\) is adopted. This choice has the effect that a large area with low resolution is considered in the beginning, while short lines with high resolution are used in later iterations. An example of correspondence lines at different scales is shown in Fig. 1. Note that scale values typically depend on the area that needs to be covered by the tracker and the size of frame-to-frame pose differences. Finally, having estimated the pose for the current image, the prediction is used to update the color histograms. After that, the tracker waits for a new image to arrive.

4.4 Occlusion Modeling

While the algorithm is quite robust to unknown occlusions, tracking results can be further improved by explicitly considering known occlusions. For this, an ID is assigned to each known object. All objects are then rendered into a depth image and an image that contains object ID values. Using a custom shader, we combine information from the two images and compute an occlusion mask that binary encodes in each pixel which objects are visible. To consider uncertainty in the object pose, the shader evaluates a region with a radius of 4 pixels and assigns the object ID with the smallest depth value to the center. If only the background is present, all object IDs are considered visible. In order to improve efficiency, a smaller image with a fourth of the camera resolution is used. Finally, to reject occluded correspondence lines, the algorithm simply checks occlusion mask values at correspondence line centers.

5 Evaluation

In this section, we present an extensive evaluation of our approach, SRT3D. Both the Region-Based Object Tracking (RBOT) dataset (Tjaden et al. 2018) and the Object Pose Tracking (OPT) dataset (Wu et al. 2017) are used to compare our method to the current state of the art in region-based tracking. We thereby evaluate the quality of the predicted pose as well as the speed of the algorithm. Also, a detailed parameter analysis is conducted that assesses the importance of different settings. Finally, we discuss essential design considerations and remaining limitations. In addition to the content in this section, we provide real-world videos on our project siteFootnote 2 that demonstrate the tracker’s performance.

5.1 RBOT Dataset

In the following, we first introduce the RBOT dataset, discuss the conducted experiments, and finally compare our results to the current state of the art. The RBOT dataset consists of a collection of 18 objects that are shown in Fig. 10. For each object, four sequences exist: a regular version, one with dynamic light, a sequence with both dynamic light and Gaussian noise, and one with dynamic light and an additional squirrel object that leads to occlusion. An example image for each sequence is shown in Fig. 11. Each sequence consists of 1001 semi-synthetic monocular images, where objects were rendered into real-world images, recorded from a hand-held camera that moves around a cluttered desk.

In the evaluation, experiments are performed as defined by Tjaden et al. (2018). The required translational and rotational errors are calculated as

$$\begin{aligned} e_{\text {t}}(t_k)= & {} \big \Vert {}_\text {C}\varvec{t}_\text {M}(t_k) - {}_\text {C}\varvec{t}_{\text {M}_\text {gt}}(t_k) \big \Vert _2, \end{aligned}$$
(42)
$$\begin{aligned} e_{\text {r}}(t_k)= & {} \cos ^{-1}\bigg (\frac{{{\,\mathrm{{\mathrm{trace}}}\,}}\left( {}_\text {C}\varvec{R}_\text {M}(t_k)^\top {}_\text {C}\varvec{R}_{\text {M}_\text {gt}}(t_k)\right) - 1}{2}\bigg ), \end{aligned}$$
(43)

where \({}_\text {C}\varvec{R}_{\text {M}_\text {gt}}(t_k)\) and \({}_\text {C}\varvec{t}_{\text {M}_\text {gt}}(t_k)\) are the ground-truth rotation matrix and translation vector for the frame \(k \in \{0, \dots , 1000\}\). A pose is considered successful if both \(e_{\text {t}}(t_k) < 5\,\hbox {cm}\) and \(e_{\text {r}}(t_k) < 5^\circ \). After the initialization of the tracker with the ground-truth pose at \(t_0\), the tracker runs until either the recorded sequence ends or tracking was unsuccessful. In the case of unsuccessful tracking, the algorithm is re-initialized with the ground-truth pose at \(t_k\). For the occlusion sequence, the method is evaluated with and without occlusion modeling. In the case of occlusion modeling, both objects are tracked simultaneously. Unsuccessful tracking of the occluding squirrel object is not considered in the reported tracking success. Finally, for the remaining tracker settings, we use \(\alpha _\text {h}=0.36\) and \(s_\text {h} \rightarrow 0 \). A detailed analysis of this choice is given in Sect. 5.3.

Results of the evaluation are shown in Table 1. Our approach is compared to the current state of the art in region-based tracking, as well as the edge-based methods of Huang et al. (2020), algorithms of Li et al. (2021) and Sun et al. (2021) that combine edge and region information, and the method of Liu et al. (2021) that uses descriptor fields in addition to region-based techniques. The comparison shows that SRT3D performs significantly better than previous methods, achieving superior results for most objects and performing best on average. This difference becomes even larger for purely region-based methods, with our algorithm performing best for almost all objects and sequences. Considering the average success rate, our approach performs about five percentage points better than the combined method of Sun et al. (2021), six percentage points better than Stoiber et al. (2020), nine percentage points better than Li et al. (2021), and more than 14 percentage points better than the next best, dense, region-based approach by Zhong and Zhang (2019). The superior tracking success compared to our previous approach is especially interesting since the main differences are only an extended smoothed step function and some changes with respect to optimization. Also, in comparison to all other, dense approaches, no advanced segmentation model is used, which, in theory, is a significant disadvantage.

Table 2 Average runtimes per frame and usage of a GPU for state-of-the-art approaches. Methods that are not purely region-based are indicated by a \(^\star \). For the occlusion modeling scenario, which considers the tracking of two objects, values are shown in parenthesis

In addition to tracking success, we also compare average runtimes. A summary for the different algorithms is given in Table 2. The evaluation of SRT3D and our previous method was conducted on the same computer with an Intel Xeon E5-1630 v4 CPU and a Nvidia Quadro P600 GPU. Because of the similarities of the two approaches, we obtain a comparable average runtime of \(1.1\,\hbox {ms}\) for the case without occlusion modeling and an improved average execution time of \(5.1\,\hbox {ms}\) for the modeled occlusion scenario. Note that in the case of occlusion modeling, occlusion masks have to be rendered, and the reported time is for the simultaneous tracking of two objects. In comparison, except for the algorithm of Liu et al. (2021), for which the execution time is six times higher, all other methods report average runtimes that are more than one order of magnitude larger. The difference is even more impressive since SRT3D and our previous approach only utilize a single CPU core and do not require a GPU. In contrast, most competing methods typically use multithreading and heavily depend on a GPU. In conclusion, while different resources and computers were used, the obtained results highlight the superior efficiency of our sparse region-based method.

5.2 OPT Dataset

While the semi-synthetic RBOT dataset features a large number of objects, a difficult, highly cluttered background, and perfect ground-truth, objects are simulated with limited realism, and only very little motion blur is applied. Those shortcomings are complemented by the OPT dataset (Wu et al. 2017), which contains real-world recordings of 3D printed objects on a white background with different speeds and levels of motion blur. In total, the dataset includes six objects and consists of 552 real-world sequences with various lighting conditions and defined trajectories recorded by a robot arm. An example image for each object is shown in Fig. 12. The sequences are classified into the following categories: translation, forward and backward, in-plane rotation, out-of-plane rotation, flashing light, moving light, and free motion.

Fig. 12
figure 12

Images from the OPT dataset (Wu et al. 2017), featuring the soda, chest, ironman, house, bike, and jet object

In the experiments, the metric of Wu et al. (2017) is used. For this, we compute the average vertex error

$$\begin{aligned} e_\text {v}(t_k) = \frac{1}{n}\sum _{i=1}^n\big \Vert \big ({}_\text {M}\pmb {\widetilde{X}}_i - {}_\text {M}\pmb {T}_{\text {M}_\text {gt}}(t_k){}_\text {M}\pmb {\widetilde{X}}_i\big )_{3\times 1}\big \Vert _2, \end{aligned}$$
(44)

with \(\varvec{\widetilde{X}}_i\) a vertex in the 3D mesh geometry of the object and n the number of vertices. Tracking is considered successful if \(e_\text {v}(t_k) < k_\text {e}d\), where d is the object diameter computed from the maximum vertex distance and \(k_\text {e}\) is an error threshold. The tracking quality for all frames is then measured using an area under curve (AUC) score that integrates the percentage value of successfully tracked poses over the interval \(k_\text {e} \in [0,0.2]\). This results in AUC scores between zero and twenty. For the tracker, the amplitude parameter \(\alpha _\text {h}=0.42\) and the slope parameter \(s_\text {h} = 0.5\) are used. Also, for the rotationally symmetric soda object, a larger rotational regularization parameter of \(\lambda _\text {r}=500000\) is adopted. The main reason is that the object geometry of the soda object does not constrain the rotation around the vertical axis. In such cases, fluctuations in the gradient and Hessian can lead to drift in the object’s orientation. Using more regularization allows us to mitigate this problem.

Results for the experiments on the OPT dataset are shown in Table 3. We thereby compare SRT3D to state-of-the-art region-based tracking approaches, as well as an approach from Bugaev et al. (2018) and prominent methods such as PWP3D (Prisacariu and Reid 2012), ElasticFusion (Whelan et al. 2015), UDP (Brachmann et al. 2016), and ORB-SLAM2 (Mur-Artal and Tardós 2017). Note that not all algorithms are dedicated 3D tracking solutions. UDP is a monocular pose detection method, while ElasticFusion and ORB-SLAM2 are visual SLAM approaches for camera pose localization that are applied to the silhouette of the object. For more information on the evaluation of those algorithms, please refer to Wu et al. (2017).

Table 3 AUC scores between zero and twenty for the evaluation on the OPT dataset (Wu et al. 2017), comparing our approach to multiple other algorithms. The best results are highlighted in bold. The second-best values are underlined

The comparison shows that our approach performs significantly better than the current state of the art in region-based tracking developed by Li et al. (2021), Zhong et al. (2020b) and Tjaden et al. (2018), achieving higher AUC scores for each of the six objects. Also, compared to none region-based approaches, we are able to report the highest score for four out of six objects and perform best on average. This is even more remarkable since ORB-SLAM2, which reports better results for the house object, uses gradient-based corner features. In contrast to SRT3D, the algorithm is thus not constrained to the contour but considers information over the entire silhouette. Also, the edge-based algorithm of Bugaev et al. (2018), which performs best for the jet object, uses basin-hopping for global optimization, and, with an average reported runtime of \(683\,\hbox {ms}\), is not real-time capable. In conclusion, the obtained results demonstrate that the excellent performance of SRT3D on simulated data translates well to applications in the real world.

5.3 Parameter Analysis

Having evaluated the performance of our approach, we want to foster our understanding of different parameter values. For this, the average success rate for the RBOT dataset and the average AUC score for the OPT dataset are plotted over different parameter values. The plots are shown in Fig. 13. Note that the success rate and AUC score are computed over all objects and sequences. Except for the parameter that is analyzed, the same settings as in Sects. 5.1 and  5.2 are used.

Fig. 13
figure 13

Average tracking success for the RBOT dataset and average AUC score for the OPT dataset over different values of the amplitude parameter \(\alpha _\text {h}\), slope parameter \(s_\text {h}\), step size \(\alpha _\text {s}\), and the rotational and translational regularization parameters \(\lambda _\text {r}\) and \(\lambda _\text {t}\). For the evaluation of the regularization parameters, we set \(\lambda _\text {t} = 100 \lambda _\text {r}\)

The evaluation of the amplitude parameter \(\alpha _\text {h}\) shows that while it significantly influences the tracking success, the effect on the AUC score is much smaller. Knowing that the amplitude parameter models a constant level of noise, this makes sense since the RBOT dataset features highly cluttered images while the OPT dataset only contains a constant white background. For the slope parameter \(s_\text {h}\), the highest tracking success is observed for \(s_\text {h} \rightarrow 0\), and the best AUC score is obtained at \(s_\text {h} = 0.5\). Again, this is well explained by the theoretical interpretation according to which the slope parameter models local uncertainty. Given perfect information about the object geometry for the semi-synthetic RBOT dataset, we do not expect any local uncertainty. At the same time, for the OPT dataset, with imperfectly 3D printed objects and recorded real-world images, it is important that a larger parameter is chosen that allows for a defined level of uncertainty.

Studying the plot of the step size \(\alpha _\text {s}\), we observe a relatively large plateau around one, with maximum values at \(\alpha _\text {s} = 1.3\) for both the tracking success and the AUC score. This suggests a low dependency between the parameter and different image data. Particularly interesting are also the results for \(\alpha _\text {s} = 0\). For this setting, no local optimization is considered, showing the capability of the global optimization alone. The good results highlight the excellent performance of the adopted global approximation.

Finally, for the evaluation of regularization, the rotational and translational parameters are modified simultaneously. To consider the different units of radians and meters, we define \(\lambda _\text {t} = 100 \lambda _\text {r}\). Like in previous evaluations of the soda object, we increase the rotational parameter and use \(\lambda _\text {r} = \lambda _\text {t}\). The resulting plot of the tracking success and the AUC score demonstrates the high importance of regularization. If values are chosen too small, the optimization is unstable for directions in which no or very little information is available. At the same time, if parameters are too large, the optimization is slowed down, and the final pose might not be reached. It is thus important to find values that lie in between. In our experience, a good approximation is to use regularization parameters that are in the same order of magnitude as the maximum rotational and translational diagonal elements of the Hessian matrix.

In conclusion, the parameter analysis demonstrates that theoretical interpretations from Sects. 2 to  3 correspond well to experimental results. In addition to fostering our understanding, this explainability helps to guide the parameter search for new applications. Moreover, the results in Fig. 13 demonstrate that all parameters are well-behaved, with large plateaus around the maximum and no sudden jumps. This has the advantage that parameters are easy to tune, with a broad range of values achieving satisfying results.

5.4 Discussion

The conducted experiments demonstrate the excellent performance of SRT3D. In the following, we want to discuss design considerations that are essential in achieving those results and shed some light on the remaining limitations of the algorithm. With respect to computational efficiency, the biggest performance gain is attributed to the correspondence line model and the sparse nature of the method. In addition, the sparse viewpoint model provides a highly efficient representation, which requires only a simple search to obtain the object contour for the current pose. Also, in contrast to dense methods, it is not necessary to compute a 2D signed distance function, but one can simply use the contour distance. Finally, the discrete scale-space formulation reduces the amount of computation further by combining multiple pixels into segments and supporting the use of precomputed smoothed step functions.

For the quality of the pose estimate, multiple aspects have to be considered. The first important factor is the use of smoothed step functions that provide a realistic modeling of local and global uncertainty. Consequently, this leads to reliable posterior probability distributions. Also, due to the one-dimensionality of correspondence lines and the discrete scale-space implementation, we are able to sample values over posterior probability distributions in reasonable time. This allows us to calculate the mean and the variance. Both estimates constitute the basis for fast-converging global optimization that is independent of local minima. In addition, knowledge about the uncertainty of individual correspondence lines is also used for local optimization, where numerical first-order derivatives are weighted according to the inverse variance. Finally, Tikhonov regularization is another important factor, which helps to constrain the estimate with respect to the previous pose, stabilizing the optimization for directions in which no or very little information is available.

While the described algorithm achieves remarkable results and works very well in a wide variety of applications, some challenges remain. The main limitations are thereby very similar to other region-based methods. The biggest constraint is that objects have to be rigid and that an accurate 3D model has to be known. Also, the background has to be distinguishable from the object. If large areas in the background contain colors that are also present in the object, the final result might be perturbed. Another challenge comes from ambiguities where the object silhouette is very similar in the vicinity of a particular pose. Naturally, in such cases, there is not enough information, and it is impossible for the algorithm to converge towards the correct pose. Also, like most tracking approaches, the algorithm can only be used for local optimization with a limit to the maximum pose difference from one frame to the next. Finally, if large parts of the object are occluded, the visible part of the contour might not fully constrain the pose of the object, leading to erroneous estimates. To illustrate all the described failure cases, we provide a video on our project siteFootnote 3.

6 Conclusion

In this work, we proposed SRT3D, a highly efficient, sparse approach to region-based 3D object tracking that uses correspondence lines to find the pose that best explains the segmentation of the image. In addition to a thorough mathematical derivation of correspondence lines, a big contribution of this work is the development of smoothed step functions that allow the modeling of both local and global uncertainty. The effects of this modeling were analyzed in detail with respect to both theoretical posterior probability distributions and the quality of the final tracking result. For the maximization of the pose-dependent joint posterior probability, we proposed the use of an initial, global optimization towards the mean and a consecutive, local optimization that considers discrete distribution values. We also developed a novel approximation for the local first-order derivative that weights the finite difference value with the inverse variance. Finally, in multiple experiments on the RBOT and the OPT dataset, we demonstrated that our algorithm outperforms the current state of the art in region-based tracking by a considerable margin both in terms of quality and efficiency.

Thanks to this superior performance, we are confident that our approach is useful to a wide range of applications in robotics and augmented reality. Because of its general formulation, it is easy to conceive ideas that extend the method. One possible direction would be to include other developments in region-based tracking, such as advanced segmentation models or occlusion detection. Also, it might be useful to consider additional information, like depth or texture. Finally, we want to highlight that the developed correspondence line model is not limited to the context of 3D tracking but might also be useful to other applications. One possible example is image segmentation. Other methods might thereby show similar progress in terms of quality and efficiency, improving their applicability to the real world.