Keywords

1 Introduction

Localizing dynamic objects and estimating the camera ego-motion in 3D space are crucial tasks for autonomous driving. Currently, these objectives are separately explored by end-to-end 3D object detection methods [1, 2] and traditional visual SLAM approaches [3,4,5]. However, it is hard to directly employ these approaches for autonomous driving scenarios. For 3D object detection, there are two main problems: 1. end-to-end 3D regression approaches need lots of training data and require heavy workload to precisely label all the object boxes in 3D space and 2. the instance 3D detection produces frame-independent results, which are not consistent enough for continuous perception in autonomous driving. To overcome this, we propose a light-weight semantic 3D box inference method depending only on 2D object detection and discrete viewpoint classification (Sect. 4). Comparing with directly 3D regression, the 2D detection and classification task are easy to train, and the training data can be easily labeled with only 2D images. However, the proposed 3D box inference is also frame-independent and conditional on the instance 2D detection accuracy. In another aspect, the well-known SLAM approaches can track the camera motion accurately due to precise feature geometry constraints. Inspired by this, we can similarly utilize the sparse feature correspondences for object relative motion constraining to enforce temporal consistency. However, the object instance pose cannot be obtained with pure feature measurement without semantic prior. To this end, due to the complementary nature of semantic and feature information, we integrate our instance semantic inference model and the temporal feature correlation into a tightly-coupled optimization framework which can continuously track the 3D objects and recover the dynamic sparse point cloud with instance accuracy and temporal consistency, which can be overviewed in Fig. 1. Benifitting from object-region-aware property, our system is able to estimate camera pose robustly without being affected by dynamic objects. Thanks to the temporal geometry constraints, we can track the objects continuously even for the extremely truncated case (see Fig. 1), where the object pose is hard for instance inference. Additionally, we employ a kinematics motion model for detected cars to ensure consistent orientation and motion estimation; it also serves as important smoothing for distant cars which have few feature observation. Depending only on a mature 2D detection and classification network [6], our system is capable of performing robust ego-motion estimation and 3D object tracking in diverse scenarios. The main contributions are summarized as follows:

Fig. 1.
figure 1

Overview of our semantic 3D object and ego-motion tracking system. Top: 3D trajectories of ego-camera and all objects in the long travel history. Bottom: from left to right: stereo feature matching for each object (Sect. 5). An extreme car-truncated case where our system can still track the moving car accurately. Dynamic 3D sparse feature recovered by our object BA. Consistent movement and orientation estimation.

  • A light-weight 3D box inference method using only 2D object detection and the proposed viewpoint classification, which provides the object reprojection contour and occlusion mask for object feature extraction. It also serves as the semantic measurement model for the follow-up optimization.

  • A novel dynamic object bundle adjustment approach which tightly couples the semantic and feature measurements to continuously track the object states with instance accuracy and temporal consistency.

  • Demonstration over diverse scenarios to show the practicability of the proposed system.

2 Related Work

We review the related works in the context of semantic SLAM and learning-based 3D object detection from images.

Semantic SLAM.

With the development of 2D object detection, several joint SLAM with semantic understanding works have sprung up, which we discuss in three categories. The first is semantic-aided localization: [7, 8] focus on correcting the global scale of monocular Visual Odometry (VO) by incorporating object metric size of only one dimension into the estimation framework. Indoor with small objects and outdoor experiments are conducted respectively in these two works. [9] proposes an object data association method in a probabilistic formulation and shows its drift correcting ability when re-observing the previous objects. However, it omits the orientation of objects by treating the 2D bounding boxes as points. And in [10], the authors address the localization task from only object observation in a prior semantic map by computing a matrix permanent. The second is SLAM-aided object detection [11, 12] and reconstruction [13, 14]: [11] develops an 2D object recognition system which is robust to viewpoint changing with the assistance of camera localization, while [12] performs confidence-growing 3D objects detection using visual-inertial measurements. [13, 14] reconstruct the dense surface of 3D object by fusing the point cloud from monocular and RGBD SLAM respectively. Finally, the third category is joint estimation for both camera and object poses: With pre-built bags of binary words, [15] localizes the objects in the datasets and correct the map scale in turns. In [16, 17], the authors propose a semantic structure from motion (SfM) approach to jointly estimate camera, object with considering scene components interaction. However, neither of these methods shows the ability to solve dynamic objects, nor makes full use of 2D bounding box data (center, width, and height) and 3-dimensions object size. There are also some existing works [18,19,20,21] building the dense map and segmenting it with semantic labels. These works are beyond the scope of this paper, so we will not discuss them in details.

Fig. 2.
figure 2

Our whole semantic tracking system architecture.

3D Object Detection. Inferring object pose from images by deep learning approaches provides an alternative way to localize 3D objects. [22, 23] use the shape prior to reason 3D object pose, where the dense shape and wireframe models are used respectively. In [24], a voxel pattern is employed to detect 3D pose of objects with specific visibility patterns. Similarly to object proposal approaches in 2D detection [1, 6] generates 3D proposals by utilizing depth information calculated from stereo images, while [2] exploits the ground plane assumption and additional segmentation features to produce 3D candidates; R-CNN is then used for candidates scoring and object recognition. Such high-dimension features used for proposal generating or model fitting are computationally complex for both training and deploying. Instead of directly generating 3D boxes, [25] regresses object orientation and dimensions in separate stages; then the 2D-3D box geometry constraints are used to calculate the 3D object pose, while purely depending on instance 2D box limits its performance in object-truncated cases.

In this work, we study the pros and cons of existing works and propose an integrated perception solution for autonomous driving that makes full use of the instance semantic prior and precise feature spatial-temporal correspondences to achieve robust and continuous state estimation for both the ego-camera and static or dynamic objects in diverse environments.

3 Overview

Our semantic tracking system has three main modules, as illustrated in Fig. 2. The first module performs 2D object detection and viewpoint classification (Sect. 4), where the objects poses are roughly inferred based on the constraints between 2D box edges and 3D box vertexes. The second module is feature extraction and matching (Sect. 5). It projects all the inferred 3D boxes to the 2D image to get the objects contour and occlusion masks. Guided feature matching is then applied to get robust feature associations for both stereo and temporal images. In the third module (Sect. 6), we integrate all the semantic information, feature measurements into a tightly-coupled optimization approach. A kinematics model is additionally applied to cars to get consistent motion estimation.

4 Viewpoint Classification and 3D Box Inference

Our semantic measurement includes the 2D object box and classified viewpoints. Based on this, the object pose can be roughly inferred instantly in close-form.

4.1 Viewpoint Classification

2D Object detection can be implemented by the state-of-the-art object detectors such as Faster R-CNN [6], YOLO [26], etc. We use Faster R-CNN in our system since it performs well on small objects. Only left images are used for object detection due to real-time requirement. The network architecture is illustrated in Fig. 2(a). Instead of the pure object classification in the original implementation of [6], we add sub-categories classification in the final FC layers, which denotes object horizontal and vertical discrete viewpoints. As Fig. 3a shown, We divide the continuous object observing angle into eight horizontal and two vertical viewpoints. With total 16 combinations of horizontal and vertical viewpoint classification, we can generate associations between edges in the 2D box and vertexes in the 3D box based on the assumption that the reprojection of the 3D bounding box will tightly fit the 2D bounding box. These associations provide essential condition to build the four edge-vertex constraints for 3D box inference (Sect. 4.2) and formulate our semantic measurement model (Sect. 6.2).

Comparing with direct 3D regression, the well-developed 2D detection and classification networks are more robust over diverse scenarios. The proposed viewpoint classification task is easy to train and have high accuracy, even for small and extreme occluded objects.

Fig. 3.
figure 3

(a) Presents all the horizontal and vertical viewpoints for our classification, their combinations are enough to cover all the observation cases in autonomous scenarios. (b) Illustrates the 3D car in a specific viewpoint, where the object frame, four vertexes corresponding to four 2D box edges are denoted respectively.

4.2 3D Box Inference Based on Viewpoint

Given the 2D box described by four edges in normalized image plane \( [u_\mathrm{min}, v_\mathrm{min}, u_\mathrm{max}, v_\mathrm{max}]\) and classified viewpoint, we aim to infer the object pose based on four constriants between 2D box edges and 3D box vertexes, which is inspired by [25]. A 3D bounding box can be represented by its center position \( \mathbf {p} = [p_x, p_y, p_z]^T\) and horizontal orientation \(\theta \) respecting to camera frame and the dimensions prior \(\mathbf {d} = [d_x, d_y, d_z]^T\). For example, in such a viewpoint presented in Fig. 3b from one of 16 combinations in Fig. 3a (denoted as red), four vertexes are projected to the 2D edges, the corresponding constraints can be formulated as:

$$\begin{aligned} \left\{ \begin{array}{lr} u_\mathrm{min} = \pi \left( \mathbf {p} + \mathbf {R}_{\theta } \mathbf {C}_1 \mathbf {d}\right) _u, \, u_\mathrm{max} = \pi \left( \mathbf {p} + \mathbf {R}_{\theta } \mathbf {C}_2 \mathbf {d}\right) _u, \\ v_\mathrm{min} = \pi \left( \mathbf {p} + \mathbf {R}_{\theta } \mathbf {C}_3 \mathbf {d}\right) _v,\, v_\mathrm{max} = \pi \left( \mathbf {p} + \mathbf {R}_{\theta } \mathbf {C}_4 \mathbf {d}\right) _v, \end{array} \right. \end{aligned}$$
(1)

where \(\pi \) is a 3D projection warp function which defined as \(\pi (\mathbf {p}) = [p_x/p_z, p_y/p_z]^T\), and \((\cdot )_u\) represents the u coordinate in the normalized image plane. We use \(\mathbf {R}_{\theta }\) to denote the rotation parameterized by horizontal orientation \(\theta \) from the object frame to the camera frame. \(\mathbf {C}_{1:4}\) are four diagonal selection matrixes to describe the relations between the object center to the four selected vertexes, which can be determined after we get the classified viewpoint without ambiguous. From the object frame defined in Fig. 3b, it’s easy to see that:

$$\begin{aligned} \mathbf {C}_{1:4}= \begin{bmatrix} 0.5&0&0 \\ 0&0.5&0\\ 0&0&0.5 \end{bmatrix}, \begin{bmatrix} -0.5&0&0 \\ 0&0.5&0\\ 0&0&-0.5 \end{bmatrix}, \begin{bmatrix} 0.5&0&0 \\ 0&-0.5&0\\ 0&0&-0.5 \end{bmatrix}, \begin{bmatrix} -0.5&0&0 \\ 0&0.5&0\\ 0&0&0.5 \end{bmatrix}. \end{aligned}$$
(2)

With these four equations, the 4 DoF object pose can be solved intuitively given the dimensions prior. This solving process has very trivial time consuming comparing with [25] which exhaustive tests all the valid edge-vertex configurations.

We convert the complex 3D object detection problem into 2D detection, viewpoint classification, and straightforward closed-form calculation. Admittedly, the solved pose is an approximated estimation which is conditioned on the instance “tightness” of the 2D bounding box and the object dimension prior. Also for some top view cases, the reprojection of the 3D box does not strictly fit the 2D box, which can be observed from the top edge in Fig. 3b. However, for the almost horizontal or slight looking-down viewpoints in autonomous driving scenarios, this assumption can be held reasonably. Note that our instance pose inference is only for generating object projection contour and occlusion mask for the feature extraction (Sect. 5) and serves as an initial value for the follow-up maximum-a-posteriori (MAP) estimation, where the 3D object trajectory will be further optimized by sliding window based feature correlation and object point cloud alignment.

5 Feature Extraction and Matching

We project the inferred 3D object boxes (Sect. 4.2) to the stereo images to generate a valid 2D contour. As Fig. 2(b) illustrates, we use different colors mask to represent visible part of each object (gray for the background). For occlusion objects, we mask the occluded part as invisible according to objects 2D overlap and 3D depth relations. For truncated objects which have less than four valid edges measurements thus cannot be inferred by the method in Sect. 4.2, we directly project the 2D box detected in the left image to the right image. We extract ORB features [27] for both the left and right image in the visible area for each object and the background.

Stereo matching is performed by epipolar line searching. The depth range of object features are known from the inferred object pose, so we limit the search area to a small range to achieve robust feature matching. For temporal matching, we first associate objects for successive frames by 2D box similarity score voting. The similarity score is weighted by the center distance and shape similarity of the 2D boxes between successive images after compensating the camera rotation. The object is treated as lost if its maximum similarity score with all the objects in the previous frame is less than a threshold. We note that there are more sophisticated association schemes such as probabilistic data association [9], but it is more suitable for avoiding the hard decision when re-visiting the static object scene than for the highly dynamic and no-repetitive scene for autonomous driving. We subsequently match ORB features for the associated objects and background with the previous frame. Outliers are rejected by RANSAC with local fundamental matrix test for each object and background independently.

6 Ego-Motion and Object Tracking

Beginning with the notation definition, we use \(\mathbf {s}^t_\mathrm{k} = \{{\mathbf {b}}^t_{kl}, \mathbf {b}^t_{kr}, l_{k}, {\mathbf {C}}^t_{k1:4}\}\) to denote the semantic measurement of the \(k^{th}\) object at time t, where \({\mathbf {b}}^t_{kl}, \mathbf {b}^t_{kr}\) are the observations of the left-top and the right-bottom coordinates of the 2D bounding box respectively, \(l_{k}\) is the object class label and \({\mathbf {C}}^t_{k1:4}\) are four selection matrixes defined in Sect. 4.2. For measurements of sparse feature which is anchored to one object or the background, we use \(^n{\mathbf {z}}^t_{k} = \{{^n}\mathbf {z}^t_{kl}, {^n}\mathbf {z}^t_{kr}\}\) to denote the stereo observations of the \(n^{th}\) feature on the \(k^{th}\) object at time t (\(k = 0\) for the static background), where \({^n}\mathbf {z}^t_{kl}, {^n}\mathbf {z}^t_{kr}\) are feature coordinates in the normalizedleft and right image plane respectively. The states of the ego-camera and the \(k^{th}\) object are represented as \(^w\mathbf {x}^t_c = \{{^w}\mathbf {p}^t_c, {^w}\mathbf {R}^t_c\}\), \({^w}\mathbf {x}^t_{ok} = \{{^w}\mathbf {p}^t_{ok}, \mathbf {d}_k, {^w}{{\theta }}^t_{ok}, v^t_{ok}, \delta ^t_{ok} \}\) respectively, where we use \(^w(\cdot )\), \((\cdot )_c\) and \((\cdot )_o\) to denote the world, camera and object frame. \(^w\mathbf {p}\) represents the position in the world frame. For objects orientation, we only model the horizontal rotation \(^w{\theta }^t_{ok}\) instead of \(\mathbb {SO}(3)\) rotation \(^w{\mathbf {R}}^t_{c}\) for the ego-camera. \(\mathbf {d}_{k}\) is the time-invariant dimensions of the \(k^{th}\) object, and \(v^t_{ok}, \delta ^t_{ok}\) are the speed and steering angle, which are only estimated for cars. For conciseness, we visualize the measurements and states in Fig. 4 at the time t.

Fig. 4.
figure 4

Notation visualization.

Considering a general autonomous driving scene, we aim to continuously estimate the ego-motion of the onboard camera from time 0 to T: \(^w\mathcal {X}_c = \{{^w\mathbf {x}^t_c}\}_{t=0:T}\), and track the \(K_t\) number of 3D objects: \(^w\mathcal {X}_o = \left\{ {^w\mathbf {x}_{ok}} \right\} _{k=1:K_t}, {^w\mathbf {x}_{ok}} = \{{^w\mathbf {x}^t_{ok}}\}_{t=0:T},\) and recover the 3D position of the dynamic sparse features: \(\mathcal {F} = \{^k\mathbf {f}\}_{k=0:K_t}, {^k}\mathbf {f} = \{^k{\mathbf {f}}_n\}_{n=0:N_k}\), (note that here we use \(^k(\cdot )\) to denote the \(k^{th}\) object frame, in which the features are relatively static, \(k=0\) for background world, in which the features are globally static), given the semantic measurements: \(\mathcal {S} = \{\mathbf {s}_k\}_{k=1:K_t},\, \mathbf {s}_k = \{\mathbf {s}^t_k\}_{t=0:T}\) and sparse feature observations anchored to the \(k^{th}\) object: \(\mathcal {Z} = \{\mathbf {z}_k\}_{k=0:K_t}, \,\mathbf {z}_k = \{{^n}\mathbf {z}_{k}\}_{n=0:N_k}. {^n}\mathbf {z}_{k} = \{{^n}{\mathbf {z}}^t_{k}\}_{t= 0:T}\). We formulate our semantic objects and camera ego-motion tracking from the probabilistic model to a nonlinear optimization problem.

6.1 Ego-Motion Tracking

Given the static background feature observation, the ego-motion can be solved via maximum likelihood estimation (MLE):

$$\begin{aligned} ^w\mathcal {X}_c, {^0}\mathbf {f}&= \mathop {\mathrm{arg\,max}}\limits _{{^w}\mathcal {X}_c, {^0}\mathbf {f}} \prod ^{N_0}_{n=0}\prod ^T_{t=0} p({^n}\mathbf {z}^t_0|^w\mathbf {x}^t_c, {^0}{\mathbf {f}_n}, {^w}\mathbf {x}^{0}_c) \; \end{aligned}$$
(3)
$$\begin{aligned}&= \mathop {\mathrm{arg\,max}}\limits _{^w\mathcal {X}_c, {^0}\mathbf {f}} \sum ^{N_0}_{n=0}\sum ^T_{t=0} \log p({^n}\mathbf {z}^t_0|^w\mathbf {x}^t_c, {^0}{\mathbf {f}_n}, {^w}\mathbf {x}^{0}_c) \; \end{aligned}$$
(4)
$$\begin{aligned}&= \mathop {\mathrm{arg\,min}}\limits _{^w\mathcal {X}_c, {^0}\mathbf {f}} \sum ^{N_0}_{n=0}\sum ^T_{t=0} \left\| r_{\mathcal {Z}}(^n{\mathbf {z}}^{t}_{0},\, ^w\mathbf {x}^t_c , {^0}{\mathbf {f}_n}) \right\| _{^0\mathbf {\sum }^{t}_{n}}^2. \end{aligned}$$
(5)

This is the typical SLAM or SfM approach. The camera pose and background point cloud are estimated conditionally on the first state. As Eq. 3 shows, the log probability of measurement residual is proportional to its Mahalanobis norm \(\left\| \mathbf {r} \right\| _{\scriptstyle {\mathbf {\sum }}}^2 = \mathbf {r}^T {\scriptstyle {\sum }}^{-1}{\mathbf {r}}\). Then the MLE is converted to a nonlinear least square problem, this process is also known as Bundle Adjustment (BA).

6.2 Semantic Object Tracking

After we solve the camera pose, the object state at each time t can be solved based on the dimension prior and the instance semantic measurements (Sect. 4.2). We assume the object is a rigid body, which means the feature anchored to it is fixed respecting to the object frame. Therefore, the temporal states of the object are correlated if we have continuous object feature observations. States of different objects are conditionally independent given the camera pose, so we can track all the objects in parallel and independently. For the \(k^{th}\) object, we have the dimension prior distribution \(p(\mathbf {d}_{k})\) for each class label. We assume the detection results and feature measurements for each object at each time are independent and Gaussian distributed. According to Bayes’ rule, we have the following maximum-a-posteriori (MAP) estimation:

$$\begin{aligned} ^w\mathbf {x}_{ok}, {^k}\mathbf {f}&= \mathop {\mathrm{arg\,max}}\limits _{^w\mathbf {x}_{ok}, {^k}\mathbf {f}} p(^w\mathbf {x}_{ok}, {^k}\mathbf {f} \,|\, {^w}\mathbf {x}_{c}, \mathbf {z}_k, \mathbf {s}_k) \end{aligned}$$
(6)
$$\begin{aligned}&= \mathop {\mathrm{arg\,max}}\limits _{^w\mathbf {x}_{ok}, {^k}\mathbf {f}} p(\mathbf {z}_k, \mathbf {s}_k | {^w}\mathbf {x}_{c}, {^w}\mathbf {x}_{ok}, {^k}\mathbf {f}) p(\mathbf {d}_{k}) \; \end{aligned}$$
(7)
$$\begin{aligned}&= \mathop {\mathrm{arg\,max}}\limits _{^w\mathbf {x}_{ok}, {^k}\mathbf {f}} p(\mathbf {z}_k | {^w}\mathbf {x}_{c}, {^w}\mathbf {x}_{ok}, {^k}\mathbf {f}) p(\mathbf {s}_k | {^w}\mathbf {x}_{c}, {^w}\mathbf {x}_{ok}) p(\mathbf {d}_{k}) \end{aligned}$$
(8)
$$\begin{aligned} = \mathop {\mathrm{arg\,max}}\limits _{^w\mathbf {x}_{ok}, {^k}\mathbf {f}} \prod ^T_{t=0} \prod ^{N_k}_{n=0}&p({^n}{\mathbf {z}}^t_k | {^w}\mathbf {x}^t_{c}, {^w}\mathbf {x}^t_{ok}, {^k}{\mathbf {f}}_n) p(\mathbf {s}^t_k | {^w}\mathbf {x}^t_{c}, {^w}\mathbf {x}^t_{ok}) p({^w}\mathbf {x}^{t-1}_{ok} | {^w}\mathbf {x}^{t}_{ok}) p(\mathbf {d}_{k}). \end{aligned}$$
(9)

Similar to Eq. 3, we convert the MAP to a nonlinear optimization problem:

$$\begin{aligned} ^w\mathbf {x}_{ok}, {^k}\mathbf {f}&= \mathop {\mathrm{arg\,min}}\limits _{^w\mathbf {x}_{ok}, {^k}\mathbf {f}} \left\{ \sum ^T_{t=0} \sum ^{N_k}_{n=0} \left\| r_{\mathcal {Z}}(^n{\mathbf {z}}^{t}_{k},\, {^w}\mathbf {x}^t_c ,{^w}\mathbf {x}^t_{ok}, {^k}{\mathbf {f}_n}) \right\| _{^k\mathbf {\sum }^{t}_{n}}^2 \right. + \left\| r_{\mathcal {P}}(d^l_k, \mathbf {d}_{k}) \right\| _{\mathbf {\sum }^{l}}^2 \nonumber \\&\,\,+ \sum ^{T}_{t=1} \left\| r_{\mathcal {M}}({^w}\mathbf {x}^{t}_{ok}, {^w}\mathbf {x}^{t-1}_{ok}) \right\| _{\mathbf {\sum }^{t}_{k}}^2 \left. +\sum ^T_{t=0} \left\| r_{\mathcal {S}}({\mathbf {s}}^{t}_{k},\, {^w}\mathbf {x}^t_c ,{^w}\mathbf {x}^t_{ok}) \right\| _{\mathbf {\sum }^{t}_{k}}^2 \right\} , \end{aligned}$$
(10)

where we use \(r_\mathcal {Z}\), \(r_\mathcal {P}\), \(r_\mathcal {M}\), and \(r_\mathcal {S}\) to denote the residual of the feature reprojection, dimension prior, object motion model, and semantic bounding box reprojection respectively. \(\scriptstyle {\mathbf {\sum }}\) is the corresponding covariance matrix for each measurement. We formulate our 3D object tracking problem into a dynamic object BA approach which makes fully exploit object dimension and motion prior and enforces temporal consistency. Maximum a posteriori estimation can be achieved by minimizing the sum of the Mahalanobis norm of all the residuals.

Sparse Feature Observation. We extend the projective geometry between static features and camera pose to dynamic features and object pose. Based on anchored relative static features respecting to the object frame, the object poses which share feature observations can be connected by a factor graph. For each feature observation, the residual can be represented by the reprojection error of predicted feature position and the actual feature observations on the left and right image:

$$\begin{aligned}&r_{\mathcal {Z}}(^n\mathbf {z}^{t}_{k},\, {^w}\mathbf {x}^t_c ,{^w}\mathbf {x}^t_{ok}, {^k}{\mathbf {f}_n}) \end{aligned}$$
(11)
$$\begin{aligned} =&\begin{bmatrix} \pi \left( h^{-1}({^w}\mathbf {x}^t_c,\, h({^w}\mathbf {x}^t_{ok}, {^k}{\mathbf {f}_n}))\right) - {^n}{\mathbf {z}}^{t}_{kl} \\ \pi \left( h({^{r}}\mathbf {x}_l, h^{-1}({^w}\mathbf {x}^t_c,\, h({^w}\mathbf {x}^t_{ok}, {^k}{\mathbf {f}_n})))\right) - {^n}{\mathbf {z}}^{t}_{kr} \end{bmatrix}, \end{aligned}$$
(12)

where we use \(h(\mathbf {x}, p)\) to denote applying a 3D rigid body transform \(\mathbf {x}\) to a point p. For example, \(h\left( {^w}\mathbf {x}^t_{ok}, {^k}{\mathbf {f}_n}\right) \) transforms the \(n^{th}\) feature point \({^k}{\mathbf {f}_n}\) from the object frame to the world frame, \(h^{-1}(\mathbf {x}, p)\) is the corresponding inverse transform. \({^r}\mathbf {x}_l\) denotes the extrinsic transform of the stereo camera, which is calibrated offline.

Semantic 3D Object Measurement. Benefiting from the viewpoint classification, we can know the relations between edges of the 2D bounding box and vertexes of the 3D bounding box. Assume the 2D bounding box is tightly fitted to the object boundary, then each edge is intersected with a reprojected 3D vertex. These relations can be determined as four selection matrixes for each 2D edge. The semantic residual can be represented by the reprojection error of the predicted 3D box vertexes with the detected 2D box edges:

$$\begin{aligned}&r_{\mathcal {S}}({\mathbf {s}}^{t}_{k},\, {^w}\mathbf {x}^t_c ,{^w}\mathbf {x}^t_{ok}, {\mathbf {d}_{k}}) = \begin{bmatrix} \pi \left( h_{\mathbf {C}_1}\right) _u - ({\mathbf {b}}^{t}_{kl})_u \\ \pi \left( h_{\mathbf {C}_2}\right) _u - ({\mathbf {b}}^{t}_{kr})_u \\ \pi \left( h_{\mathbf {C}_3}\right) _v - ({\mathbf {b}}^{t}_{kl})_v \\ \pi \left( h_{\mathbf {C}_4}\right) _v - ({\mathbf {b}}^{t}_{kr})_v \end{bmatrix}, \end{aligned}$$
(13)
$$\begin{aligned}&h_{\mathbf {C}_i} = h^{-1}({^w}\mathbf {x}^t_c,\, h({^w}\mathbf {x}^t_{ok}, \mathbf {C}_i{\mathbf {d}^l_k})), \end{aligned}$$
(14)

where we use \(h_{\mathbf {C}_i}\) to project a vertex specified by the corresponding selection matrix \(\mathbf {C}_i\) of the 3D bounding box to the camera plane. This factor builds the connection between the object pose and its dimensions instantly. Note that we only perform 2D detection on the left image due to the real-time requirement.

Vehicle Motion Model. To achieve consistent estimation of motion and orientation for the vehicle class, we employ the kinematics model introduced in [28]. The vehicle state at time t can be predicted with the state at \({t-1}\):

$$\begin{aligned} {^w}\hat{\mathbf {x}}^{t}_{ok} = \begin{bmatrix} {^w}\mathbf {p}^t_{ok} \\ {^w}{{\theta }}^t_{ok} \\ \delta ^{t}_{ok}\\ v^{t}_{ok} \end{bmatrix}&= \begin{bmatrix} \mathbf {I}_{3 \times 3},&\mathbf {0},&\mathbf {0},&\varvec{\varLambda } \\ \mathbf {0},&1,&0,&\frac{\tan (\delta ) \Delta t}{L} \\ \mathbf {0},&0,&1,&0 \\ \mathbf {0},&0,&0,&1 \end{bmatrix} \begin{bmatrix} {^w}\mathbf {p}^{t-1}_{ok} \\ {^w}{{\theta }}^{t-1}_{ok} \\ \delta ^{t-1}_{ok} \\ v^{t-1}_{ok} \end{bmatrix}, \varvec{\varLambda } = \begin{bmatrix} \cos (\theta )\Delta t \\ \sin (\theta )\Delta t\\ 0 \end{bmatrix}, \end{aligned}$$
(15)
$$\begin{aligned}&r_{\mathcal {M}}({^w}{\mathbf {x}}^{t}_{ok},{^w}\mathbf {x}^{t-1}_{ok}) = {^w}{\mathbf {x}}^{t}_{ok} - {^w}\hat{\mathbf {x}}^{t}_{ok}, \end{aligned}$$
(16)

where L is the length of the wheelbase, which can be parameterized by the dimensions. The orientation of the car is always parallel to the moving direction. We refer readers to [28] for more derivations. Thanks to this kinematics model, we can track the vehicle velocity and orientation continuously, which provides rich information for behavior and path planning for autonomous driving. For other class such as pedestrians, we directly use a simple constant-velocity model to enhance smoothness.

Fig. 5.
figure 5

Point cloud alignment.

Point Cloud Alignment.

After minimizing all the residuals, we obtain the MAP estimation of the object pose based on the dimension prior. However, the pose might be biased estimated due to object size difference (See Fig. 5). We therefore align the 3D box to the recovered point cloud, which is unbiased because of accurate stereo extrinsic calibration. We minimize the distance of all 3D points with their anchored 3D box surfaces:

$$\begin{aligned} ^w{\mathbf {x}}^{t}_{ok} = \mathop {\mathrm{arg\,min}}\limits _{^w\mathbf {x}_{ok}} \sum ^{N_k}_{n=0} d({^w\mathbf {x}}^{t}_{ok}, ^k\mathbf {f}_n), \end{aligned}$$
(17)

where \(d({^w\mathbf {x}}^{t}_{ok}, ^k\mathbf {f}_n)\) denotes the distance of the \(k^{th}\) feature with its corresponding observed surface. After all the above information is tightly fused together, we get consistent and accurate pose estimation for both the static and dynamic objects.

7 Experimental Results

We evaluate the performance of the proposed system on both KITTI [29, 30] and Cityscapes [31] dataset over diverse scenarios. The mature 2D detection and classification module has good generalization ability to run on unseen scenes, and the follow-up nonlinear optimization is data-independent. Our system is therefore able to perform consistent results on different datasets. The quantitative evaluation shows our semantic 3D object and ego-motion tracking system has better performance than the isolated state-of-the-art solutions.

Fig. 6.
figure 6

Continuous tracking results over long trajectories. (a) shows a roughly 700 m close-loop trajectory including both static and dynamic cars. The right top and right bottom are enlarged start and end views respectively. The car in the blue circle is tracked over 200 m, the trajectory of which can be found in the left top view. (b) shows a scenario which mainly contains dynamic and truncated cars. The estimated trajectory, velocity and reprojected 2D image are presented in left and right respectively. Note that the LiDAR point cloud is only for reference in all the top views.

7.1 Qualitative Results over Diverse Scenarios

Firstly, we test the system on long challenging trajectories in KITTI dataset which contains 1240 \(\times \) 376 stereo color and grayscale images captured at 10 Hz. We perform 2D detection on left color images and extract 500 (for the background) and 100 (for the object) ORB features [27] on both left and right grayscale images. Figure 6a shows a 700 m close-loop trajectory which includes both static and dynamic cars. We use red cone and line to represent the camera pose and trajectory, and various color CAD models and lines to represent different cars and their trajectories, all the observed cars are visualized in the top view. Currently, our system performs object tracking in a memoryless manner, so the re-observed object will be treated as a new one, which can also be found in the enlarged start and end views in Fig. 6a. In Fig. 6b, the black car is continuously truncated over a long time, which is an unobservable case for instance 3D box inference (Sect. 4.2). However, we can still track its pose accurately due to temporal feature constraints and dynamic point cloud alignment.

We also demonstrate the system performance on different datasets over more scenarios which include concentrated cars, crossroads, and dynamic roads. All the reprojected images and the corresponding top views are shown in Fig. 7.

Fig. 7.
figure 7

Qualitative examples over diverse scenarios. From left to right: Concentrated cars. Crossroads which include both cars and pedestrians (note that we do not solve orientation for pedestrians), Dynamic cars. The top two rows are the results on the KITTI dataset, and the bottom two rows show the results on the Cityscapes dataset.

7.2 Quantitative Evaluation

Since there are no available integrated academic solutions for both ego-motion and dynamic objects tracking currently, we conduct quantitative evaluations for the camera and objects poses by comparing with the isolated state-of-the-art works: ORB-SLAM2 [4] and 3DOP [1].

Fig. 8.
figure 8

(a) RPEs comparison. Left and right are the results of 0929_0004 and 1003_0047 sequences from the KITTI raw dataset respectively. (b) RMSEs of ATE comparisons on ten long KITTI raw sequences.

Camera Pose Evaluation. Benefiting from the semantic prior, our system can perform robust camera estimation in dynamic environments. We evaluate the accuracy of camera odometry by comparing the relative pose error (RPE) [29] and RMSE of ATE (Absolute Trajectory Error) [32] with the ORB-SLAM2 [4] with stereo settings. Two sequences in KITTI raw dataset: 0929_0004 and 1003_0047 which include dynamic objects are used for RPEs comparison. The relative translation and rotation errors are presented in Fig. 8(a). Ten long sequences of KITTI raw dataset are additionally used to evaluate RMSEs of ATE, as detailed in Fig. 8(b). It can be seen that our estimation shows almost same accuracy with [4] in less dynamic scenarios due to the similar Bundle Adjustment approaches (0926_0051, etc.). However, our system still works well in high dynamic environments while ORB-SLAM2 shows non-trivial errors due to introducing many outliers (1003_0047, 0929_0004, etc.). This experiment shows that the semantic-aided object-aware property is essential for camera pose estimation, especially for dynamic autonomous driving scenarios.

Object Localization Evaluation. We evaluate the car localization performance on KITTI tracking dataset since it provides sequential stereo images with labeled objects 3D boxes. According to occlusion level and 2D box height, we divide all the detected objects into three regimes: easy, moderate and hard then evaluate them separately. To evaluate the localization accuracy of the proposed estimator, we collect objects average position error statistics. By setting seriesof Intersection-over-Unions (IoU) thresholds from 0 to 1, we calculate the true positive (TP) rate and the average error between estimated position of TPs and ground truth at each instance frame for each threshold. The average position error (in %) vs TP rate curves are shown in Fig. 9, where we use blue, red, yellow lines to represent statistics for easy, moderate and hard objects. It can be seen that the average error for half tuth positive objects is below 5%. For all the estimated results, the average position errors are 5.9%, 6.1% and 6.3% for easy, moderate and hard objects respectively.

Fig. 9.
figure 9

Average position error vs TP rate results. We set 40 discrete IoU thresholds from 0 to 1, then count the TP rate and the average position error for the true positives for each IoU threshold.

To compare with baselines, we evaluate the Average Precision (AP) for bird’s eye view boxes and 3D boxes by comparing with 3DOP [1], the state-of-the-art stereo based 3D object detection method. We set IoU thresholds to 0.25 and 0.5 for both bird’s eye view and 3D boxes. Note that we use the oriented box overlap, so the object orientation is also implied evaluated in these two metrics. We use S, M, F, P to represent semantic measurement, motion model, feature observation, and point cloud alignment respectively. As listed in Table 1, the semantic measurement serves as the basis of the 3D object estimation. Adding feature observation increases the AP for easy (near) objects obviously due to large feature extraction area (same case for adding point clout alignment), while adding motion model helps the hard (far) objects since it “smooths” the non-trivial 2D detection noise for small objects. After integrating all these cues together, we obtain accurate 3D box estimation for both near and far objects. It can be seen that our integrated method shows more accurate results for all the AP in bird’s eye view and 3D box with 0.25 IoU threshold. Due to the unregressed object size, our performance slightly worse than 3DOP in 3D box comparison of 0.5 IoU. However, we stress our method can efficiently track both static and dynamic 3D objects with temporal smoothness and motion consistency, which is essential for continuous perception and planning in autonomous driving.

Table 1. Average precision (in %) of bird’s eye view and 3D boxes comparison.

8 Conclusions and Future Work

In this paper, we propose a 3D objects and ego-motion tracking system for autonomous driving. We integrate the instance semantic prior, sparse feature measurement and kinematics motion model into a tightly-coupled optimization framework. Our system can robustly estimate the camera pose without being affected by the dynamic objects and track the states and recover dynamic sparse features for each observed object continuously. Demonstrations over diverse scenarios and different datasets illustrate the practicability of the proposed system. Quantitative comparisons with state-of-the-art approaches show our accuracy for both camera estimation and objects localization.

In the future, we plan to improve the object temporal correlation by fully exploiting the dense visual information. Currently, the camera and objects tracking are implemented successively in our system. We are also going to model them into a fully-integrated optimization framework such that the estimation for both camera and dynamic objects can benefit from each other.