Monocular visual-IMU odometry using multi-channel image patch exemplars

In this paper, we propose three sets of multi-channel image patch features for monocular visual-IMU (Inertial Measurement Unit) odometry. The proposed feature sets extract image patch exemplars from multiple feature maps of an image. We also modify an existing visual-IMU odometry framework by using different salient point detectors and feature sets and replacing the inlier selection approach with a self-adaptive scheme. The modified framework is used to examine the proposed feature sets. In addition to the Root Mean Square Error (RMSE) metric, we use the Hausdorff distance to measure the inconsistency between the estimated and ground-truth trajectories. Compared to the point-wise comparison used by RMSE, the Hausdorff distance takes the shape inconsistency of two trajectories into account and is hence more perceptually consistent. Experimental results show that the multi-channel feature sets outperform, or perform comparably to, the single gray level channel feature sets examined in this study. Particularly, the multi-channel feature set that uses integral channels, i.e., ICIMGP (Integral Channel Image Patches), outperforms two state-of-the-art feature sets: SIFT (Scale Invariant Feature Transform) and SURF (Speed Up Robust Features). Besides, ICIMGP performs better than the two multi-channel feature sets that are designed based on derivative channels and gradient channels respectively. These promising results are attributed to the fact that the multi-channel features encode richer image characteristics than their single gray level channel counterparts.


Introduction
Accurate and reliable ego-motion estimation in dynamic and unknown surroundings plays an important role in the autonomous robot navigation and localization tasks.In contrast to the steady and invariant indoor environment, it is more challenging for robots to estimate their current location in highly dynamic outdoor environments [23].In the literature, the Inertial Measure Unit (IMU) is one of the most commonly used sensors for solving this issue

Related Work
In this section, we review the previous work involved in visual(-IMU) odometry, the salient point detectors and local features used in visual odometry systems and multi-channel image features.

Visual (-IMU) Odometry
Motion trajectory estimation is a well-studied topic in the fields of computer vision and robotics.As an extensively applied technique to motion trajectory estimation, Visual Odometry (VO) incrementally estimates the motion trajectory of the camera using consecutive image frames [30].According to the camera involved, VO can be divided into two categories: monocular and stereo [30].In this study, we focus on the monocular VO application because of its simplicity and effectiveness.
Since IMU (Inertial Measurement Unit) and cameras are complementary, the combination of these improves both the reliability and precision of the motion trajectory estimation [22] [25].In the literature, the tightly-coupled fusion solutions have received much attention.Inspired by the study of Davison et al. [8], Pinies et al. [27] proposed an Extended Kalman Filter (EKF) based real-time fusion framework using monocular vision.Instead of using the constant velocity motion model [8], the IMU motion model was utilized in this framework.On the basis of the work that Mourikis and Roumeliotis proposed [25], Hu and Chen [20] further introduced a sliding-window monocular visual-IMU odometry framework and achieved a tradeoff between the computational cost and precision.In addition, Shen et al. [31] presented a monocular vision based visual-inertial fusion framework in order to estimate the flight trajectory of the micro aerial vehicles (MAVs).
In contrast, the loosely-coupled fusion solutions consider IMU and the visual module as two separate parts.For instance, Sirtkaya et al. [32] fused the inertial navigation method with the relative pose estimation between consecutive image frames.However, regardless whether the tightly-coupled system or the loosely-coupled system is used, robust image feature sets are required in order to represent image sufficiently.

Salient Point Detectors and Local Features Used in Visual Odometry Systems
In Visual Odometry systems, the motion trajectory of vehicles is normally estimated by matching the consecutive frames recorded by cameras [30].In order to enhance the image matching speed, a set of salient points are usually selected from images.Hence, salient point detection and feature extraction are two key components of those systems.In the applications of the vision based motion trajectory estimation, the most commonly used salient point detectors include the Harris [18] [22] and FAST corner detectors [29] [3], which are efficient but less distinctive and redetected.
In the computer vision community, two blob detectors: Difference of Gaussians (DoG) [24] and Fast Hessian [2] were proposed along with two local feature sets: Scale Invariant Feature Transform (SIFT) and Speed Up Robust Features (SURF), respectively.Compared with the corner detectors and associated image features, the SIFT and SURF feature sets are rotation-invariant.Both the feature sets have been used in visual (IMU) odometry systems [20] [25] [28].In order to obtain both blob and corner points, Geiger et al. [16] introduced a blob and corner detector and a set of Sobel [33] filtered image patches based features.In addition, gray level image patch features have been used in visual (-IMU) odometry applications [3] [6] [26].
However, the local feature sets mentioned above are normally extracted from the single gray level channel of images.In contrast, multi-channel image representations provide richer image characteristics and own stronger discriminatory power.

Multi-channel Image Features
Multi-channel image features normally exploit different image feature channels and can be regarded as the extension of gray level features.Barros et al. [1] used the gray level image and the corresponding Sobel [33] filtered images computed at the horizontal and vertical directions for hand posture recognition.Considering the gradient data is associated with the image structure, Dong et al. [11] learned textons from the joint distributions of the gradient magnitude and gradient direction data computed using the Canny detector [4].
Recently, RGB-D images have been applied in computer vision applications.Dollár et al. [10] computed the depth and normal gradient data from RGB-D images in order to improve the performance of edge detection.Furthermore, Gupta et al. [17] exploited the combination of the height above ground, the angle with gravity and the horizontal disparity from RGB-D images.In addition, Dollár et al. [9] proposed a set of integral channel features.The integral channels are obtained from the linear and non-linear transformations of the original image.
Those features have been successfully applied to object detection and contour detection [9].
Inspired by the studies mentioned above, we intend to investigate the application of multi-channel features to monocular visual-IMU odometry systems in this study.Our hypothesis is that multi-channel features outperform their counterparts computed from the gray level channel.Due to the efficiency and effectiveness of image patch features [34], we first extract local image patch exemplars from each channel separately and then concatenate these into a single feature vector.

Multi-channel Image Patch Features
It has been shown that multi-channel image features provide richer information than gray level features [1,9,10,11,17].In this section, we introduce three sets of multi-channel image patch features, which will be used in the odometry framework described in Section 5. We first describe three different types of multi-channel data and then introduce the three multi-channel image patch feature sets.Fig. 1 A gray level image frame contained in the KITTI dataset [15].

Multiple Derivative Channels
Image derivative maps can be computed using a variety of convolution filters, such as the Canny [4] and Sobel [33] operators.Considering the tradeoff between accuracy and the computational cost, we use the Canny operator [4] in this study.Given an input gray level image , (see Fig. 1 for example), the derivative maps , and , in the directions of x and y can be computed as: where , is a Gaussian function, , and , are the derivative functions of , in the x and y directions respectively.The derivative maps obtained from the image shown in Fig. 1 are displayed in Fig. 2.

Multiple Gradient Channels
Both image gradient magnitudes and gradient orientations can be used for image representation because they encode different structural characteristics of images.In essence, gradient magnitudes denote how quickly an image is changing, while gradient orientations mean in which direction the image color or intensity changes fast [11].In addition, due to the influence of lighting conditions or camera properties, two images of the same scene may show different appearance.In this situation, the extracted features may fail to match the corresponding images.However, the gradient information has the merit of being less sensitive to lighting and camera changes.
Therefore, the image features which are extracted from the gradient data computed from the original images are useful and robust.
In this study, the gradient maps are computed using the Canny operator [4].Given the derivative maps: and computed using Equations ( 1) and ( 2) respectively, the gradient magnitude map can be calculated using the following equation: Correspondingly, the gradient orientation map is computed using the equation below: In terms of the image shown in Fig. 1, the gradient magnitude and orientation maps computed are shown in Fig. 3.

Fast Hessian Salient Point Detector
In this study, we used the Fast Hessian salient point detector [2] to find a set of salient points from each gray level image frame.This detector was introduced as a part of the Speed Up Robust Features (SURF) [2] method.
The detector was designed based on the determinant of the Hessian matrix.The Hessian matrix at the scale of is expressed as [2]: where and !! are the first order and second order Gaussian derivative functions respectively, , is the input image, and * is the convolution function.
Due to the heavy computational cost of the high scale Gaussian convolution, Bay et al. [2] approximated the Laplacian of Gaussian functions using box filters.As a result, the speed of the Fast Hessian detector is dramatically increased.An accurate approximation of the Hessian determinant based on the approximated Gaussian functions is given as: where ) , ) , and ) are the box filter approximations in the directions of , and both and respectively.
The detection of the local maxima of Equation ( 6) over different orientations and scales generates the salient points of the input image.In our experiments, the Fast Hessian detector was applied to each gray level image frame and a set of salient points were obtained.

Multi-channel Image Patch Feature Sets
The straightforward representation of a pixel can be obtained as the image patch surrounding it.Compared to other image features, e.g., SIFT [24] and SURF [2], image patch based features retain original image characteristics while being more computationally efficient.Since local image patches often experience less distortion than global images, the similarity measurement between two local patches is more robust.In the literature, it has been shown that even small image patches can provide strong discriminatory power for texture images [34].
Considering the merits of multi-channel representations, we therefore use the image patches extracted from a series of different feature channels of an image to build a feature vector.The features are only extracted at the locations of the salient points detected using Fast Hessian [2] or other salient point detectors.Sample image patches extracted from the three types of multi-channel data described in the previous subsections are illustrated in We will test the three feature sets together with several state-of-the-art feature sets using a modified monocular visual-IMU odometry framework [20] in Section 6.

Performance Measures
The Root Mean Square Error (RMSE) metric is normally used as performance measure for visual odometry systems.However, this measure only considers the accumulation of the error between the ground-truth position and the estimated position without taking into account the dissimilarity of the shape of trajectories.Therefore, in addition to RMSE, we applied a shape matching algorithm, namely, the Hausdorff distance [12], to measure the inconsistency between the estimated and ground-truth trajectories.In essence, the Hausdorff distance compares two trajectories based on both local and global similarity and is more perceptually-consistent than RMSE.Fig. 10 The pipeline of the extraction process of the "ICIMGP" feature set.

Root Mean Squared Error (RMSE)
The RMSE measure is computed as the square root of the mean of the squared differences between the estimate data and the ground-truth data.When two sets of 2D coordinate data: / and 0 are used, the computation of RMSE is defined as: where ( < , < ) belongs to the ground truth dataset / while ( 5 < , 5 < ) is a member of the estimated dataset 0.
As shown in Equation (7), RMSE only takes the accumulation of point-wise errors into account while ignoring the global dissimilarity of two trajectories.According to the definition, the RMSE measure is symmetric (i.e., $(/, 0) = $(0, /)) because it computes the error between two points with the same index on two different trajectories.Fig. 11 illustrates the point-to-point comparison procedure used by RMSE.It can be seen that the matching from trajectory / to 0 is identical with that from trajectory 0 to /.

The Hausdorff Distance
In mathematics, the Hausdorff distance [12] is calculated as the maximal of the distances between a point in one set and its nearest point in another set.It measures the distance between two subsets of a metric space.The Hausdorff distance has been used for shape matching in computer vision.In this study, we used the modified Hausdorff distance introduced in [12].Given two point sets: / and 0, the modified Hausdorff distance between these sets is computed as: where and G are the points in / and 0 respectively, and #( , G) is the direct distance between and G (which is usually computed using the Euclidian distance).It can be learned from Equation (8) that the Hausdorff distance is asymmetric, i.e., ℎ(/, 0) ≠ ℎ(0, /).Fig. 12 shows the difference between the matching computations from trajectory / to 0 and from trajectory 0 to /.Therefore, a more general definition of the Hausdorff distance is expressed as: (/, 0) = B {ℎ(/, 0), ℎ(0, /)}.

Comparison of RMSE and the Hausdorff Distance
Three different paths are shown in Fig. 13, including the ground-truth path, Path 1 and Path 2. It can be observed that Path 2 is more similar to the ground-truth path than Path 1.However, the RMSE value 0.118 computed between the ground-truth path and Path 1 is the same as that calculated between the ground-truth path and Path 2. In contrast, the Hausdorff distance between Path 1 and the ground-truth path is 0.091 while that computed between Path 2 and the ground-truth path is 0.063.This result suggests that the Hausdorff distance is more consistent with human perceptual judgements than the RMSE metric.Therefore, we used both RMSE and the Hausdorff distance as performance measures in this study.
Fig. 13 The ground-truth path and its two different estimations.

Experimental Setup
Hu and Chen [20] proposed a visual-IMU odometry system which combines a monocular camera with an IMU.
In this study, we modified this system in two aspects.First, we revised the salient point detection and feature extraction module in order that different salient point detectors and/or feature sets can be used in this system.
Second, we replaced the feature matching and inlier selection module using a self-adaptive scheme in order to prevent the system from exceptionally crashing when insufficient inliers were returned.In addition, three paths selected from a publicly available dataset [15] were used along with the modified system.

The Modified Visual-IMU Odometry System
Inspired by the multi-state constraint Kalman filter [25], Hu and Chen [20] proposed a monocular visual-IMU odometry system (see Fig. 14 for pipeline) in which the trifocal tensor geometry relationship [19] between three images is used as camera measurement.This design releases the requirement to estimate the 3D position of feature points.The use of a moving-window scheme further accelerates the computational speed while retaining proper accuracy.The odometry system also applies the Random Sample Consensus (RANSAC) [13] method in order to reject the mismatched feature points or the feature points which locate on individual moving objects.Fig. 14 The pipeline of the visual-IMU odometry system proposed in [20].

Filter State Initialization
The filter state vector consists of the IMU state and the last two poses of the camera.The position, orientation, velocity, the IMU inner accelerator bias and the IMU inner gyroscope bias are comprised of the IMU state vector.
The filter state includes two states: nominal and error.In this study, we used the same parameters as those used by Hu and Chen [20].

5.1.2
Filter Propagation In the filter propagation step, prediction of the filter nominal state is performed using the IMU nominal state based on 4th-order Runge Kutta [5].Prediction of the error state can be described as: where M N is the continuous-time state transition matrix, N is the Jacobian matrix of the body error state, and QR is the system noise.The discrete time state transition matrix M S can be obtained by substituting M N into Tayler series: The continuous time system noise covariance matrix W N is defined as: while its discrete form W S can be obtained as: On the basis of M S and W S , the predicted error state covariance matrix \ P|P can be calculated as: \ P|P = M S \ P |P M S X + W S .( 14)

Salient Point Detection and Feature Extraction
When a new image is recorded by the camera, the system will conduct salient point detection and feature extraction.The reason for using a set of salient points rather than all the pixels in the image is to reduce the computational complexity.In this study, salient points were detected from gray level images.After salient point detection is complete, image features are extracted at the salient points detected.We modified the original framework [20] in order to incorporate different salient point detectors and image feature sets into it.

5.1.4
Self-Adaptive Feature Matching and Inlier Selection The trifocal geometry relationship [19] between three consecutive image frames is used as the camera measurement data.In order to map the matched feature points between the first two frames into the third frame, the trifocal tensor [19] is utilized.In this study, we used the feature matching algorithm that Lowe [24] introduced.
A "bucketing" method [21] is further used to choose a certain number of matched feature points for the purpose of minimizing the computational cost.
Since the dataset [15] used in this study was collected in the outdoor environment and contains various moving objects, these objects may lead to the mismatch of the points between consecutive frames or the match between independent objects.Hu and Chen [20] used the RANSAC method [13] to select inliers in order to address this problem.However, this method may return an insufficient or even empty inlier set, especially, when the feature matching operation is conducted strictly.Thus, we introduce a simple self-adaptive feature matching and inlier selection scheme.Specifically, the system will automatically adjust the threshold used by the feature matching algorithm and re-conduct the feature matching operation in the case that the inlier selection based on RANSAC fails.This process will be repeated until a proper number of inliers are returned.The self-adaptive scheme guarantees that the odometry system will not exceptionally crash when inlier selection fails.

5.1.5
Filter State Update Once inliers are obtained, they are used to update the error state and error covariance of the filter based on the epipolar geometry model and the trifocal tensor model [19].The nominal state of the filter is also corrected.Since only three poses are required for the filter state vector, the old state is replaced using the current state with the error covariance modified.

Dataset
Along with the odometry system described above, we used three paths (Path 1, Path 2 and Path 3) that Hu and Chen [20] selected from a publicly available and challenging dataset: the KITTI dataset [15].Sample images of these paths are illustrated in Fig. 15.The data was captured in the residential area or on the highway using a recording platform.This platform was equipped with multiple sensors, including two high-resolution stereo cameras (gray level and color), a Velodyne laser scanner, and an OXTS RT 3003 GPS/IMU localization unit with RTK correction signals.All sensors contained in this data recording system had been calibrated and synchronized.
The resolution of the images used in this paper is 1241×376 pixels (Paths 1 and 2) or 1226×370 pixels (Path 3).

Experimental Design
In our study, the modified visual-IMU odometry system was utilized in four experiments.The three paths contained in the KITTI dataset [15] were used in each experiment and only the synchronized gray level images were utilized.Regarding each path, the GPS/IMU localization unit data was used as ground-truth while the pure IMU method was used as a baseline.Both RMSE and the Hausdorff distance [12] were used as position error measures.In the first experiment, we tested three proposed multi-channel feature sets: DERP+IMGP, GRADP+IMGP and ICIMGP together with the Fast Hessian salient point detector [2].In addition to the IMU data, three state-of-the-art feature sets: image patches (IMGP) [34], SIFT [24] and SURF [2] were used as baselines.In the second experiment, we investigated the effect of the size of image patches on the best multi-channel feature set found in the first experiment.We further examined the effect of salient point detectors on the best multi-channel feature set in the third experiment.Finally, in the fourth experiment we applied Principal Component Analysis (PCA) [35] to the best multi-channel feature set in order to reduce the dimensionality of feature vectors, and tested the compressed feature set together with the Fast Hessian detector [2].It can be seen that: (1) IMU often suffers the error accumulation issue when the path is long and complicated (see Figs. 16 (a) and (c)).However, it performs properly when the path is straight and the speed is high (see Fig.

(b)
).This should be attributed to the fact that IMU is suitable for the case of the high speed, linear motion; (2) the combination of monocular camera and IMU can reduce the drift issue; (3) the three multi-channel feature sets perform better than the gray level image patch feature set, and outperform, or perform comparably to, SIFT [24] and SURF [2]; (4) the multi-channel feature set using integral channels, i.e., ICIMGP, outperforms all its counterparts tested in this experiment regardless of whether the experiment is conducted in the residential area or on the highway; (5) no matter which performance measure is used, the performance of all the methods obtained on different paths are similar.
In addition, we compare the results obtained in this experiment with those derived using the pure monocular visual odometry method [16] (referred to as "VO") by Hu and Chen [20].The position RMSE values obtained using this method on the three paths are 33.9685m,596.3744m and 211.2474m respectively [20].(Note that the Hausdorff distance data is not available in this case).As illustrated in Table 1, the position RMSE values derived using the VIO methods based on the three proposed multi-channel feature sets are much smaller than these values.
Since the multi-channel feature set based on integral channels performs better than the other multi-channel feature sets, we only examine it in the following experiments.

The Effect of the Size of Image Patches
The advantage of the proposed integral channel image patch (ICIMGP) feature set over its counterparts has been shown in the previous experiment.In this experiment, we investigated the effect of the size of image patches on the ICIMGP feature set.The size of image patches was set as 7×7, 9×9, 11×11, 13×13 and 15×15 pixels.The IMU data was used as a baseline.
The ground-truth trajectory and the trajectories obtained using IMU and the ICIMGP feature set with different sizes of image patches are shown in Fig. 17.In addition, the overall position RMSE and Hausdorff distance values are reported in Table 2.It can be observed that: (1) the ICIMGP feature set performs better, or comparably to, the IMU method when different sizes of image patches are used; (2) the ICIMGP feature set performs better when 11×11 image patches are used than it does when the other sizes of patches are utilized; and (3) no matter the experiment is performed in the residential area or on the highway, the ICIMGP feature set with different sizes of image patches performs properly.It should be noted that the performance of ICIMGP is not improved when the size of image patches is beyond 11×11 pixels.This finding is similar to that Gauglitz et al. [14] observed.

Conclusions
In this paper, we proposed three different sets of multi-channel image patch features for Monocular Visual-IMU Odometry.Compared to the single gray level channel image feature sets, the proposed feature sets are able to exploit richer image characteristics.We modified the monocular visual-IMU odometry system that Hu and Chen [20] proposed.Different salient point detectors or image feature sets can be incorporated into this system.A selfadaptive inlier selection scheme was also used to replace the original scheme.This new scheme prevents the system from exceptionally crashing due to the selection of insufficient inliers.In addition to the RMSE metric, we used the Hausdorff distance [12] to measure the inconsistency between the estimated and ground-truth trajectories.This distance measure compares two trajectories based on both local and global similarity and is more perceptually-consistent than RMSE.The modified odometry system was applied to three different paths [15] in order to test the proposed multichannel feature sets.The GPS/IMU localization unit data with regard to each path was used as ground-truth.The pure IMU method, the gray level image patch [34], SIFT [24] and SURF [2] feature sets were used as baselines.
Experimental results show the advantages of the multi-channel image feature sets over single gray level channel feature sets.In particular, the multi-channel feature set based on the 11×11 integral channel [9] image patches outperformed all its counterparts tested in this study, including two state-of-the-art feature sets: SIFT [24] and SURF [2], when it was used together with the Fast Hessian salient point detector [2].It is noteworthy that the performance was slightly enhanced but the feature matching speed was dramatically accelerated when PCA [35] was used to reduce the dimensionality of ICIMGP feature vectors.We attribute these promising results to the fact that the multi-channel feature sets encode more diverse image characteristics than their single gray level channel counterparts.

Fig. 2
Fig. 2 Two derivative maps: (a) the derivative map computed in the x direction and (b) the derivative map calculated in the y direction.

Fig. 3 3 . 3
Fig. 3 Gradient channel maps: (a) the gradient magnitude map and (b) the gradient orientation map.3.3Integral ChannelsDollár et al.[9] introduced a set of integral channel features based on the linear and non-linear transformations of images.Compared with the traditional gray level or color features, the integral channel features provide more diverse but heterogeneous information.The integral channel method contains the normal channel(s), the gradient magnitude channel and six gradient histogram channels without smoothing.If the input image, is a gray level image, the normal channel , is the same as this image.In contrast, when a color image is considered, each color channel is treated as an individual normal channel.In addition to the normal channel(s), the other channels are computed from the linear or non-linear transformation of ,[9].The eight integral channel maps of the gray level image shown in Fig.1are displayed in Fig.4.

Fig. 4
Fig. 4 Eight integral channel maps, including (a) the gray level channel, (b) the gradient magnitude channel and (c-h) six gradient histogram channels.

Figures 5 ,
Figures 5, 6 and 7  respectively.In terms of each salient point, the image patches extracted at this point in different

Fig. 5 Fig. 6 Fig. 7
Image patch exemplars (11×11) extracted from the two derivative channel maps which are computed at: (a) the x direction and (b) the y direction.Image patch exemplars (11×11) extracted from two gradient channel maps: (a) the gradient magnitude map and (b) the gradient orientation map.Image patch exemplars (11×11) extracted from the eight integral channel maps which are shown in Fig. 4.

Fig. 8
Fig.8The pipeline of the extraction process of the "DERP+IMGP" feature set.

Fig. 9
Fig.9The pipeline of the extraction process of the "GRADP+IMGP" feature set.

Fig. 11
Fig. 11 Illustration of the point-wise matching used by RMSE: (a) the matching from trajectory / to 0 and (b) the matching from trajectory 0 to /.

Fig. 12
Fig. 12 Illustration of the Hausdorff distance computation: (a) the computation from trajectory / to 0 and (b) the computation from trajectory 0 to /.