Improving the Accuracy of Stereo Visual Odometry Using Visual Illumination Estimation

In the absence of reliable and accurate GPS, visual odometry (VO) has emerged as an effective means of estimating the egomotion of robotic vehicles. Like any dead-reckoning technique, VO suffers from unbounded accumulation of drift error over time, but this accumulation can be limited by incorporating absolute orientation information from, for example, a sun sensor. In this paper, we leverage recent work on visual outdoor illumination estimation to show that estimation error in a stereo VO pipeline can be reduced by inferring the sun position from the same image stream used to compute VO, thereby gaining the benefits of sun sensing without requiring a dedicated sun sensor or the sun to be visible to the camera. We compare sun estimation methods based on hand-crafted visual cues and Convolutional Neural Networks (CNNs) and demonstrate our approach on a combined 7.8 km of urban driving from the popular KITTI dataset, achieving up to a 43% reduction in translational average root mean squared error (ARMSE) and a 59% reduction in final translational drift error compared to pure VO alone.


Motivation, Problem Statement, and Related Work
In the absence of reliable and accurate GPS, visual odometry (VO) has emerged as an effective means of estimating the egomotion of robotic vehicles as they navigate through their environment. While VO is generally less prone to drift than other dead-reckoning techniques such as wheel odometry, any dead-reckoning algorithm will inevitably accumulate drift over time due to the compounding of small estimation errors. Indeed, VO suffers from superlinear growth of drift error with distance travelled, mainly due to error in the orientation estimates [14]. Fortunately, the addition of absolute orientation information from, for example, a sun sensor can restrict this growth to be linear [14].
The sun is an appealing source of absolute orientation information since it is readily detectable and its apparent motion through the sky is well characterized in ephemeris tables. The benefits of deriving orientation information from a sun sensor have been successfully demonstrated in planetary analogue environments [6,11] as well as on board the Mars Exploration Rovers (MERs) [3,13].
In particular, Lambert et al. [11] showed that by incorporating sun sensor and inclinometer data directly in a stereo VO pipeline, the accumulated drift error can be greatly reduced compared to pure VO alone.
In this work, we seek to answer the question of whether similar reductions in stereo VO drift can be obtained solely from the image stream already being used to compute VO. The main idea here is that by reasoning over more than just the geometric information available from a standard RGB camera, we can improve existing VO techniques without needing to rely on a dedicated sun sensor or specially oriented camera. Recently, Lalonde et al. [10] demonstrated that the likely direction of the sun can be estimated from a single RGB image using a combination of weak visual cues such as shadows and a model of the sky [15]. We improve the accuracy and reliability of this technique by incorporating information from the VO estimate itself, and combine it with a modified version of the sun-sensor-augmented stereo VO pipeline developed by Lambert et al. [11] to show that VO drift error can be reduced in this way. We also investigate the use of a recent machine learning approach to sun direction estimation, which makes use of a Convolutional Neural Network (CNN) to predict the azimuth angle of the sun [12]. We present experimental results demonstrating our approach on a combined 7.8 km of urban driving from the popular KITTI dataset [7], achieving up to a 59% reduction in final translational drift error and a 43% reduction in translational average root mean squared error (ARMSE) compared to pure VO.

Technical Approach
We adopt a sliding window stereo VO technique that has been used in a number of successful mobile robotics applications [2,5,8,9]. While this technique is not the absolute state of the art, 1 it serves as an easily implementable baseline system against which to evaluate our use of visual illumination estimation in the VO pipeline. We stress that our main idea is not tied to any specific VO technique and could be used in any VO system where RGB images are available.
Our goal is to estimate a window of SE(3) poses {T k+1,b , . . . , T k+N,b } expressed in a base coordinate frame F − → b , which we choose to be the first pose in each window. Our VO pipeline tracks keypoints across pairs of stereo images and computes an initial guess for each pose in the window using frame-to-frame point cloud alignment, which it then refines using a local bundle adjustment over the window. Finally, the estimated camera trajectory can be transformed into a desired world coordinate frame F − → w given the transformation T b,w , which can be obtained from the bundle adjustment solution of the previous window. As we discuss in Section 2.3, we select the initial pose T 1,w to be the first GPS ground truth pose such that F − → w is a local East-North-Up (ENU) coordinate system.

Observation Model
We assume that our stereo images have been de-warped and rectified in a preprocessing step, and model the stereo camera as a pair of perfect pinhole cameras with focal lengths f u , f v and principal points (c u , c v ), separated by a fixed and known baseline . If we take p j b to be the homogeneous 3D coordinates of keypoint j, expressed in our chosen base frame F − → b , we can transform the keypoint into the camera frame at pose k to obtain p j can then be formulated as where (u, v) are the pixel coordinates in the left image and d is the disparity.

Sliding-window Visual Odometry
We use the open-source libviso2 package [8] to detect and track keypoints between stereo image pairs. Based on these keypoint tracks, a three-point Random Sample Consensus (RANSAC) algorithm [4] generates an initial guess of the interframe motion and rejects outlier keypoint tracks by thresholding their reprojection error. We compound these pose-to-pose transformation estimates through our chosen window and refine them using a local bundle adjustment, which we solve using the nonlinear least-squares solver Ceres [1]. The objective function to be minimized can be written as where e y k,j =ŷ k,j − y k,j is the reprojection error of keypoint j for camera pose k, R y k,j is the covariance of the errors, and the outer sum runs over the chosen window of poses. The predicted measurements are given byŷ k,j = g T k,bp j b , whereT k,b andp j b are the estimated poses and keypoint positions in base frame F − → b , which we choose to be the first camera frame in the window.

Orientation Correction
In order to combat drift in the VO estimate produced by accumulated orientation error, we adopt the technique of Lambert et al. [11] to incorporate absolute orientation information from the sun directly into the estimation problem. We assume the initial camera pose and its timestamp are available from GPS and use them to determine the global direction of the sun s w , expressed as a 3D unit vector, from ephemeris data, where we have defined the world frame F − → w to be a local ENU coordinate frame. For simplicity, we assume that the full trajectory of the camera is sufficiently short so that the sun is effectively static, although it would be straightforward to obtain the global sun direction at each timestep for longer trajectories where the apparent motion of the sun is significant. By transforming the global sun direction into each camera frame F − → k in the window, we obtain predicted sun directionsŝ k =T k,b T b,w s w , whereT k,b is the current estimate of camera pose k in the base frame, and T b,w is the fixed, previously estimated transformation from the world frame to the base frame. We compare the predicted and estimated sun directions to introduce an additional error term into the bundle adjustment cost function (cf. Equation (2)): where e s k =ŝ k − s k is the error in the predicted sun direction, and R s k is the covariance of the errors. This additional term constrains the orientation of the camera, which helps limit drift in the VO result due to orientation error [11]. In contrast to [11], we operate directly on the 3D unit sun vectors rather than the underlying two angular degrees of freedom. While we could also use cosine distance as the error term in our cost function, in our Ceres-based implementation we found that using a Euclidean error term improved the problem's convergence properties. This is likely because the distribution of cosine distances is not well described by a zero-mean Gaussian distribution (see Figure 2).
In principle, Equations (2) and (3) could include an additional term to account for uncertainty in the transformation T b,w , which was previously an estimated quantity. Although the omission of this term means that our estimator may be under-confident in the sun measurements for certain segments of the trajectory, we found that a well chosen static covariance on the sun measurements nevertheless produced good results in practice. We therefore defer an investigation of this more principled uncertainty propagation to future work.

Visual Illumination Estimation
While Lambert et al. [11] make use of a hardware sun sensor to estimate the direction of the sun relative to the vehicle, in our approach we wish to use the existing RGB image stream to compute this illumination information in addition to the motion of the camera. We examine three techniques for estimating the sun direction in a single outdoor RGB image: the technique of Lalonde et al. [10], which estimates the sun direction based on a combination of weak visual cues; an improved version of [10] that makes use of a novel VO-informed prior term to improve its accuracy and reliability; and Sun-CNN, a recent technique for estimating the sun direction using a Convolutional Neural Network (CNN) [12]. "Lalonde" [10] estimates the maximum likelihood azimuth-zenith sun direction in a single RGB image by combining relatively weak information from a (a) An ambiguous detection resulting in an incorrect maximum likelihood solution, using the prior term of [10].
(b) The ambiguity is resolved using a VO-informed prior, which constrains the distribution over sun positions. physically based sky model [15], shadow detection, pedestrian detection, and vertical surface detection routines, as well as a data-driven prior term that captures the distribution of typical sun zeniths in photographs. An implementation of this technique is freely available as open-source software. 2 For our purposes, we use only a subset of these visual cues since the others tended to produce erroneous or null results in our experiments. Specifically, we use the sky model, shadow detection, and prior term described in [10]. Figure 1a shows an example of the results we obtained using this method. Note that in this case the algorithm produced an incorrect sun detection due to the bimodal ambiguity in the shadow cue and the symmetry of the sky model and prior term.
Since [10] tends to fail in the presence of ambiguous shadows and saturated sky pixels, we reject obvious outliers in our VO pipeline by thresholding the cosine distance between the observed and predicted sun directions based on the current pose estimate. In practice, we found a cosine distance threshold of 0.3 to be a reasonable choice. However, as shown in Figure 2a, the distribution of zenith errors is skewed. This is due to the bias introduced by the prior term of [10], which fails to correctly capture the distribution of sun zeniths in the KITTI dataset. We resolve this issue by thresholding the zenith error (or, equivalently, the y-component error in the camera frame) to exclude the skewed portion of the distribution, yielding a more Gaussian-like distribution over zenith errors. "Lalonde-VO" is a modified version of [10] where we have replaced the original zenith-only prior term with a novel prior term that incorporates the expected sun   direction based on the current VO estimate. The motivation for incorporating this information is twofold. First, in cases where the sky cue fails, the shadow cue's bimodal probability distribution forces the algorithm to choose one of the two possible solutions at random, leading to a high proportion of erroneous measurements (Figure 1a). By incorporating a weak prior based on the estimated camera pose, we can resolve the ambiguity in the two solutions ( Figure 1b). Second, ambiguous shadow cues often result in an incorrect pair of maximum likelihood sun azimuths, yet there is typically a secondary pair of local maxima with lower probability that are in fact correct. The sky cue alone is not generally strong enough to bias the result towards the correct direction in these cases, but our new VO-informed prior term allows the algorithm to ignore incorrect shadow orientations and incorporate information from the weaker pair of maxima.
We define our VO-informed prior term as a Gaussian distribution over azimuth and zenith angles whose mean is the expected sun direction, and choose the covariance of this distribution such that the 3σ bounds on the azimuth prior span 360 • , while the 3σ bounds on the zenith prior span 90 • . In this way, we account for uncertainty in the camera poses and avoid excessively biasing the sun detection; we need only bias the result towards the correct 'half' of the sky.
"Sun-CNN" [12] uses a Convolutional Neural Network (CNN) trained on sequences from the KITTI dataset [7] annotated with ground truth sun directions to estimate the likely azimuth angle of the sun from a single RGB image. Ma et al. [12] show that Sun-CNN substantially outperforms [10] in terms of azimuth estimation accuracy on the KITTI odometry benchmark, but since it does not estimate the zenith angle of the sun, it is best suited to planar navigation tasks such as autonomous driving of land vehicles. Since our sun-corrected VO pipeline requires the full 3D direction of the sun relative to the camera, we assign a value of zero and a large covariance to the vertical component of the Sun-CNN estimate so that the unknown component of the sun direction is effectively ignored.

Results
We present results for a combined 7.8 km of urban driving from the popular KITTI dataset [7] using a two-frame sliding window and estimated sun directions from each algorithm for every fifth image. Figure 3 shows sample frames from five sequences in the KITTI raw dataset, ranging in length from 300 m to 3.7 km, which we selected mainly for their strong shadows and unsaturated sky pixels. We evaluate the translational and rotational average root mean squared error (ARMSE) and the final translational drift error of our VO algorithm, both with and without the sun-based orientation correction.
We processed each sequence using the same set of stereo feature tracks obtained from libviso2 [8], first using pure VO alone, then by incorporating measurements from each sun detection method in turn. The covariances associated with each sun detection algorithm were individually tuned to reflect the measurement error distribution of each algorithm, and we made a bona fide effort to present the best performance of each algorithm on each sequence. Figure 4 shows the estimated and ground truth trajectories for the 2.2 km sequence 2011 09 30 drive 0018. With the exception of the "Lalonde" method, the sun-aided VO trajectories are noticeably closer to ground truth than the pure VO trajectory. The "Lalonde" method appears to have had minimal impact on this sequence due to the relatively low number of inlier sun detections.  Fig. 3: Sample frames from five sequences from the KITTI raw dataset [7], ranging in length from 300 m to 3.7 km. These sequences contain strong shadows and mostly unsaturated skies, which are amenable to visual sun direction estimation.  Table 1 quantifies the difference in the each result by reporting their translational and rotational ARMSE, as well as the final translational drift error, relative to ground truth. We see that including the sun-based orientation correction can yield a substantial reduction in estimation error compared to pure VO, particularly on the longer sequences, which contain several sharp turns. This is especially apparent in the case of sequence 2011 09 30 drive 0018, which enjoys a 43% reduction in translational ARMSE (62% in-plane), and a 59% reduction in final translational drift error (64% in-plane) using the Sun-CNN method [12]. We stress that this improvement is purely due to information already available in the existing image stream -no additional sensors are required. On the other hand, short straight sequences such as 2011 09 26 drive 0019 and 2011 09 26 drive 0039 do not benefit significantly from sun measurements since the accumulated orientation error in the VO estimate is already small.
Overall, the "Sun-CNN" and "Lalonde-VO" methods outperform the "Lalonde" method in terms of reducing estimation error in our stereo VO pipeline. This is to be expected since the "Lalonde-VO" method incorporates additional information about the temporal consistency of the images, while Ma et al. [12] have already shown that Sun-CNN is both more accurate and more reliable than [10] on single images in the KITTI dataset. While "Sun-CNN" and "Lalonde-VO" yield the minimum estimation error in similar numbers of cases, in the cases where "Sun-CNN" performs better, it does so by a wide margin. Furthermore, Sun-CNN is faster to evaluate than the other two algorithms while simultaneously avoiding hand-crafted features and approximate models of hand-picked cues. This suggests that high level scene understanding using machine learning may be a promising tool for improving robot localization accuracy in addition to providing semantic information about the environment.

Conclusions and Main Experimental Insights
In this work we have shown that estimation error in stereo visual odometry (VO) can be reduced by exploiting global illumination information available from the same image stream used to compute VO. The main insight here is that there is much to be gained in visual navigation by reasoning over more than just geometry. In particular, the notion of embracing illumination as a tool for localization is one that has not been widely adopted, yet is a promising direction for future research. Convolutional Neural Networks (CNNs) in particular appear to be excellent tools for extracting illumination information in a form amenable to conventional VO techniques. Future work might focus on developing these tools further to yield even greater gains in localization accuracy and robustness.