Keywords

1 Introduction

Simultaneous localization and mapping (SLAM) [1,2,3] technology is widely used in autonomous driving, drones, AR/VR and other fields. SLAM technology is mainly divided into lidar SLAM based on lidar and visual SLAM based on camera. Lidar is not affected by changes in illumination and can accurately measure the depth information of the environment. It has been widely used in the field of autonomous driving in recent years. However, high-precision lidar has a high cost and is prone to distortion during high-speed movement. Scenes with single environmental texture information such as indoor corridors are easy to degrade. In contrast, the cost of the camera is much lower than that of the lidar, and it can make full use of the environmental texture information. The effect in the low-texture environment is better than that of the lidar. The difficulty of visual SLAM lies in how to minimize the mismatch of features and the error in pose estimation process.

The camera and the inertial measurement unit (Inertial Measurement Unit, IMU) both have good complementarity. IMU measurement is not affected by environmental characteristics, and its accurate estimation of fast motion in a short period of time can effectively compensate for the camera's state estimation of fast motion, while camera measurement can correct the cumulative error of IMU. At the same time, the IMU can also be used to correct the point cloud distortion caused by the high-speed movement of the lidar. Therefore, in recent years, SLAM with fusion of vision or lidar and inertia has become a research hotspot in the field of SLAM. There are usually two schemes for fusion with inertial sensors, one is a filtering scheme based on EKF (extended Kalman filter) or ESKF (error state Kalman filter) [4, 5]. Such as MSCKF [6, 7] and FAST-LIO [8], et al. Another optimization-based scheme [9, 10] represented by VINS-Mono [11] and LIO-SAM [12].

However, there are always some problems in various current SLAM schemes.

  • In scenes with a single environment texture or rich local environment texture, feature extraction is difficult or the extracted feature points are densely distributed locally, which reduces the operating efficiency of the system.

  • Feature mismatch affects system stability.

  • IMU cumulative error affects system efficiency, et al.

In response to the above problems, this paper proposes a SLAM scheme based on the tight coupling of stereo vision and IMU based on the VINS-Fusion [13, 14] framework. The main innovations of this paper are as follows:

  • A feature uniform extraction mechanism is proposed to ensure the uniform distribution of feature points and effectively enhance the accuracy of the system.

  • Based on RANSAC, an improved feature mismatch elimination method is proposed to improve the mismatch elimination.

  • For the back-end IMU error term, the Huber kernel function is used to adjust the IMU error within an appropriate range.

The method proposed in this paper was tested on the public data set, which verified that the system has good accuracy and robustness in various scenarios. At the same time, a real vehicle platform was built to test it, proving that The effectiveness of the proposed method.

2 Overview

The structure of the stereo vision and inertial fusion SLAM system proposed in this paper is shown in Fig. 1. The system starts with front-end data preprocessing, including feature extraction and tracking, feature point triangulation to obtain depth information, and IMU measurement between two consecutive frames based on the camera frame do pre-integration. In the back-end nonlinear optimization part, various observation information extracted by the front-end are used to construct error constraints, including visual re-projection error, IMU pre-integration error, and prior error of sliding window marginalization. Loop detection uses the DBoW2 [15] bag-of-words model as the basis.

Fig. 1
A flowchart. Camera and I M U lead to feature detection and tracking and I M U preintegration. After initializing, it leads to sliding window non-linear optimization directly or indirectly leading to 6 D o F pose and 3-D map points and key frame. If the loop closure is done, it results in pose graph optimization, or else, repeat.

Frame diagram of stereo vision and inertial fusion system

2.1 Visual Processing

For each new image, FAST corner points are extracted, and the existing features are tracked by the KLT [16] sparse optical flow method, and the minimum pixel threshold between two adjacent features is set to achieve a uniform distribution of feature points. The improved RANSAC [17] method is used to eliminate feature mismatches. First, the cross-validation method is used for rough elimination, and then the two-point RANSAC method is used for fine elimination to obtain correctly matched feature points.

Another task in this step is the selection of key frames in the sliding window, and there are two selection criteria. The first one is to judge whether the average disparity between the current frame and the previous key frame exceeds the set threshold, and if it exceeds the threshold, it is judged as a new key frame. Another criterion is to judge whether the number of features of the current frame is lower than a certain threshold, and this frame is also regarded as a key frame if it is lower than the threshold. This criterion is to avoid the complete loss of feature tracks.

2.2 IMU Preintegration

The Acceleration \({\widehat{a}}_{t}\) and angular velocity \({\widehat{\omega }}_{t}\) with noise measured by the IMU are modeled as follows.

$$ \left\{ {\begin{array}{*{20}l} {\hat{a}_{t} = a_{t} + b_{{a_{t} }} + n_{a} + R_{w}^{t} g^{w} } \hfill \\ {\hat{\omega }_{t} = \omega_{t} + b_{{\omega_{t} }} + n_{\omega } } \hfill \\ \end{array} } \right. $$
(1)

where \({a}_{t}\) and \({\omega }_{t}\) are the true valuesof acceleration and angular velocity. \({n}_{a}\) and \({n}_{\omega }\) are Gaussian white noise from accelerometer and gyroscope measurements, \({n}_{a}\sim N(0,{\sigma }_{a}^{2})\), \({n}_{\omega }\sim N(0,{\sigma }_{\omega }^{2})\). \({b}_{{a}_{t}}\) and \({b}_{{\omega }_{t}}\) are acceleration and angular velocity deviations modeled by random walk, respectively, and their derivatives are Gaussian white noise. \({R}_{w}^{t}\) is the rotation matrix from the world coordinate system to the current IMU coordinate system. \({{\text{g}}}^{w}\) is the gravity acceleration vector in the world coordinate system.

Taking the image frame \({b}_{k}\) as a reference, the position, velocity and rotation measured by the IMU at time \({b}_{k+1}\) are modeled as follows [18].

$$ \left\{ {\begin{array}{*{20}l} {p_{{b_{k + 1} }}^{{b_{k} }} = R_{w}^{{b_{k} }} \left( {p_{{b_{k} }}^{w} + v_{{b_{k} }}^{w} \Delta t_{k} - \frac{1}{2}g^{w} \Delta t_{k}^{2} } \right) + \alpha_{{b_{k + 1} }}^{{b_{k} }} } \hfill \\ {v_{{b_{k + 1} }}^{{b_{k} }} = R_{w}^{{b_{k} }} \left( {v_{{b_{k} }}^{w} - g^{w} \Delta t_{k} } \right) + \beta_{{b_{k + 1} }}^{{b_{k} }} } \hfill \\ {q_{{b_{k + 1} }}^{{b_{k} }} = \gamma_{{b_{k + 1} }}^{{b_{k} }} } \hfill \\ \end{array} } \right. $$
(2)

where

$$ \left\{ {\begin{array}{*{20}l} {\alpha_{{b_{k + 1} }}^{{b_{k} }} = \iint_{{t \in \left[ {t_{k} ,t_{k + 1} } \right]}} {R_{t}^{{b_{k} }} \left( {\hat{a}_{t} - b_{{a_{t} }} } \right)}d_{{t^{2} }} } \hfill \\ {\beta_{{b_{k + 1} }}^{{b_{k} }} = \int_{{t \in \left[ {t_{k} ,t_{k + 1} } \right]}} {R_{t}^{{b_{k} }} \left( {\hat{a}_{t} - b_{{a_{t} }} } \right)d_{t} } } \hfill \\ {\gamma_{{b_{k + 1} }}^{{b_{k} }} = \int_{{t \in \left[ {t_{k} ,t_{k + 1} } \right]}} {\frac{1}{2}q_{t}^{{b_{k} }} \otimes \left( {\hat{\omega }_{t} - b_{{\omega_{t} }} } \right)d_{t} } } \hfill \\ \end{array} } \right. $$
(3)

Since the IMU measurement frequency is higher than that of the camera, there are multiple inertial measurement values in the time interval \(\Delta {t}_{k}\in \left[{t}_{k},{t}_{k+1}\right]\), so the position \({p}_{{b}_{k+1}}^{{b}_{k}}\), velocity \({v}_{{b}_{k+1}}^{{b}_{k}}\) and rotation change \({q}_{{b}_{k+1}}^{{b}_{k}}\) of the IMU during this period are preintegrated. Where: \({\alpha }_{{b}_{k+1}}^{{b}_{k}}\) is the position preintegration amount. \({\beta }_{{b}_{k+1}}^{{b}_{k}}\) is the speed preintegration amount. \({\gamma }_{{b}_{k+1}}^{{b}_{k}}\) is the rotation preintegration amount. Through the above formula, the constantly changing state change can be obtained by using frame \({b}_{k}\) as a reference.

When the estimation of the IMU bias has a small change, the preintegration measurement value is corrected by the following first-order approximation, so as to obtain a more accurate pre-integration measurement value. This method saves a lot of computing resources for the optimization-based algorithm, because Measurements that do not require repeated propagation of the IMU.

$$ \left\{ \begin{gathered} \alpha_{{b_{k + 1} }}^{{b_{k} }} \approx \hat{\alpha }_{{b_{k + 1} }}^{{b_{k} }} + J_{{b_{a} }}^{\alpha } \delta b_{{a_{k} }} + J_{{b_{\omega } }}^{\alpha } \delta b_{{\omega_{k} }} \hfill \\ \beta_{{b_{k + 1} }}^{{b_{k} }} \approx \hat{\beta }_{{b_{k + 1} }}^{{b_{k} }} + J_{{b_{a} }}^{\beta } \delta b_{{a_{k} }} + J_{{b_{\omega } }}^{\beta } \delta b_{{\omega_{k} }} \hfill \\ \gamma_{{b_{k + 1} }}^{{b_{k} }} \approx \hat{\gamma }_{{b_{k + 1} }}^{{b_{k} }} \otimes \left[ \begin{gathered} 1 \\ \frac{1}{2}J_{{b_{\omega } }}^{\gamma } \delta b_{{\omega_{k} }} \\ \end{gathered} \right] \hfill \\ \end{gathered} \right. $$
(4)

where \(J\) is the partial derivative Jacobian matrix of the error term with respect to the variable to be optimized.

3 Tightly Coupled Stereo VIO

When the visual measurement and IMU measurement are preprocessed and the system is initialized successfully, the nonlinear optimization link based on sliding window tight coupling can be performed, as shown in Fig. 2.

Fig. 2
An illustrated flow diagram. Features f 0, f 1, f 2, and f 3 lead to the camera and I M U states X 0, X 1, X 2, and X 3 via visual measurements. I M U measurements are passed on to X 0 to X 1, followed by X 2 and X 3.

Diagram of a tightly coupled sliding window

3.1 Optimization Variables

All variables \(\chi \) to be optimized in the sliding window are defined as follows.

$$ \left\{ \begin{gathered} \chi = \left[ {x_{0} ,x_{1} , \cdots x_{n} ,\lambda_{0} ,\lambda_{1} , \cdots \lambda_{m} } \right] \hfill \\ x_{k} = \left[ {p_{{b_{k} }}^{w} ,v_{{b_{k} }}^{w} ,q_{{b_{k} }}^{w} ,b_{a} ,b_{\omega } } \right],k \in \left[ {0,n} \right] \hfill \\ \end{gathered} \right. $$
(5)

where \({x}_{k}\) is the state of the IMU at the kth frame of the camera, including the position, velocity and rotation of the IMU in the world coordinate system at this time, as well as the acceleration deviation and gyroscope deviation of the IMU; n is the total key in the sliding window The number of frames; m is the number of feature points in the sliding window; \({\lambda }_{i}\) is the reciprocal of the first observed depth value of the ith feature point.

Optimizing the state variable by minimizing the sum of all residuals by modeling an overall optimization model within a sliding window.

$$ \mathop {\min }\limits_{\chi } \left\{ {\left\| {r_{p} - H_{p} \chi } \right\|^{2} + \sum\limits_{{k \in {\rm B}}} {\rho \left\| {r_{\rm B} \left( {\hat{z}_{{b_{k + 1} }}^{{b_{k} }} ,\chi } \right)} \right\|_{{P_{{b_{k + 1} }}^{{b_{k} }} }}^{2} } + \sum\nolimits_{{\left( {l,j} \right) \in C}} {\rho \left\| {r_{C} \left( {\hat{z}_{{b_{l} }}^{{C_{j} }} ,\chi } \right)} \right\|_{{P_{l}^{{C_{j} }} }}^{2} } } \right\} $$
(6)

where \(\rho \) is the Huber norm.

$$ \rho \left( e \right) = \left\{ {\begin{array}{*{20}l} {e,} \hfill & {\left| e \right| \le 1} \hfill \\ {2\sqrt e - 1,} \hfill & {\left| e \right| > 1} \hfill \\ \end{array} } \right. $$
(7)

\({r}_{B}({\widehat{z}}_{{b}_{k+1}}^{{b}_{k}},\chi )\) and \({r}_{C}({\widehat{z}}_{{b}_{l}}^{{C}_{j}},\chi )\) are the residual terms of the IMU and visual measurements, respectively. B is the set of all IMU measurements, and C is the set of feature points within the current sliding window. \(\left\{{r}_{p},{H}_{p}\right\}\) is the prior information related to marginalization, which is used to construct the prior residual term. Use the Ceres solver [19] to solve the entire nonlinear optimization problem.

3.2 IMU Residual Block

Construct the residual term of the IMU measurement between two consecutive frames \({b}_{k}\) and \({b}_{k+1}\) within the sliding window according to the IMU preintegration.

$$ r_{B} \left( {\hat{z}_{{b_{k + 1} }}^{{b_{k} }} ,\chi } \right) = \left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {R_{w}^{{b_{k} }} \left( {p_{{b_{k + 1} }}^{w} - p_{{b_{k} }}^{w} - v_{{b_{k} }}^{w} \Delta t_{k} + \frac{1}{2}g^{w} \Delta t_{k}^{2} } \right) - \hat{\alpha }_{{b_{k + 1} }}^{{b_{k} }} } \\ {R_{w}^{{b_{k} }} \left( {v_{{b_{k + 1} }}^{w} - v_{{b_{k} }}^{w} + g^{w} \Delta t_{k} } \right) - \hat{\beta }_{{b_{k + 1} }}^{{b_{k} }} } \\ \end{array} } \\ {2\left[ {q_{{b_{k + 1} }}^{w} \otimes \left( {q_{{b_{k} }}^{w} } \right)^{ - 1} \otimes \left( {\hat{\gamma }_{{b_{k + 1} }}^{{b_{k} }} } \right)^{ - 1} } \right]_{{xy{\text{z}}}} } \\ {b_{{a_{k + 1} }} - b_{{a_{k} }} } \\ {b_{{\omega_{k + 1} }} - b_{{\omega_{k} }} } \\ \end{array} } \right] $$
(8)

where [·]xyz means to extract the vector part of the quaternion.

3.3 Visual Residual Block

Assuming that the observed points of the feature point l on the normalized plane of frame i and frame j are respectively \({p}_{l}^{{C}_{i}}({u}_{{C}_{i}},{v}_{{C}_{i}})\) and \({p}_{l}^{{C}_{j}}({u}_{{C}_{j}},{v}_{{C}_{j}})\), the visual reprojection error is constructed by back-projecting \({p}_{l}^{{C}_{i}}\) to frame j.

$$ \left\{ \begin{gathered} r_{C} \left( {\hat{z}_{l}^{{C_{j} }} ,\chi } \right) = \overline{p}_{l}^{{C_{j} }} - p_{l}^{{C_{j} }} \hfill \\ \overline{p}_{l}^{{C_{j} }} = T_{b}^{C} T_{w}^{{b_{j} }} T_{{b_{i} }}^{w} T_{C}^{b} \frac{1}{{\lambda_{l} }}p_{l}^{{C_{i} }} \hfill \\ \end{gathered} \right. $$
(9)

where \({\overline{p} }_{l}^{{C}_{j}}\) is the back-projected point of \({p}_{l}^{{C}_{i}}\) onto the jth frame. T is the transformation matrix between different coordinate systems. \({\lambda }_{l}\) is the depth value observed for the first time by feature point l.

3.4 Marginalization

In order to reduce the computational complexity of optimizing VIO based on sliding windows, a marginalization strategy is adopted, and the marginalized measurements are converted into prior information. The number of key frames in the sliding window is guaranteed to be the set value.

As shown in Fig. 3, when a new frame arrives, if the latest frame in the sliding window is a key frame at this time, the oldest frame in the sliding window and the corresponding measurement. Otherwise, the latest frame in the current sliding window is marginalized, but the IMU measurement of this frame is retained, and the current frame enters the sliding window to become the new latest frame.

Fig. 3
2 illustrated flow diagrams. Features f 0, f 1, f 2, and f 3 lead to the camera and I M U states X 0, X 1, X 2, and X 3 via visual measurements. I M U measurements are passed on to X 0 to X 1, followed by X 2 and X 3. X 0 and X 2 in the left diagram and X 2 in the right one are marginalized.

Illustration of marginalization strategy

Marginalization was achieved using Schur complement [20]. Construct prior residuals based on all marginalized measurements to achieve further optimization of state variables.

4 Experimental Results

In order to evaluate the accuracy and robustness of the system proposed in this paper, the algorithm proposed in this paper was tested and verified on the public data set and the real vehicle platform built by ourselves. Select the EuRoC public dataset [21] to compare the stereo vision and inertial fusion algorithm VINS-Fusion with the algorithm in this paper. At the same time, use the self-built real vehicle platform to verify the effectiveness of the algorithm in this paper.

4.1 Dataset Experiment

In order to verify the localization accuracy of the algorithm proposed in this paper, four motion sequence data of MH_04_difficult, MH_05_difficult, V1_03_difficult and V2_03_difficult of the EuRoC visual-inertial dataset were used for testing. The data in this dataset is collected using UAV-mounted sensors, including binocular images, IMU measurements, and ground-truth data. In this experiment, the algorithm in this paper is compared with the VINS-Fusion algorithm, and the comparison results are shown in Table 1.

Table 1 Algorithm error comparison in dataset experiment

It can be seen from Table 1 that the relative pose error (RPE) of the algorithm proposed in this paper is smaller than that of VINS-Fusion on the four selected sequences, and the root mean square error (RMSE) is reduced by 12.41% on average, which proves the effectiveness and robustness of the algorithm in this paper.

Figure 4 is a comparison of the trajectories between the algorithm of this paper and VINS-Fusion in the above four sequences. In the figure, ground truth is the real motion trajectory provided by the EuRoC dataset, and Ours represents the method of this paper. It can be seen from Fig. 4 that the running trajectory of the algorithm in this paper is closer to the real trajectory, and the error is smaller.

Fig. 4
Four line graphs labeled a to d plot y versus x, both in meters. The graphs plot fluctuating lines forming loops for ground truth, ours, and V I N S fusion.

a Trajectory in V1_03_difficult. b Trajectory in V2_03_difficult. c Trajectory in MH_04_difficult. d Trajectory in MH_05_difficult

4.2 Real Scene Experiment

In order to verify the effectiveness of the algorithm in this paper, experiments are carried out in actual scene. The experimental platform is a crawler chassis equipped with ZED2i stereo camera and GNSS, as shown in Fig. 5. The camera is used to collect images and IMU data. It has a built-in 6-axis IMU. While collecting images and IMU data, it records GNSS data for making ground truth in real scenes. In order to facilitate the comparison and ensure that the two algorithms run on the same trajectory, this paper adopts the method of recording data sets to conduct comparative experiments in real scene.

Fig. 5
A photo of a mobile vehicle equipped with a G N S S, camera, vehicle computing platform, and mobile power.

Real vehicle test platform

The experimental environment Fig. 6 is a scene with good lighting conditions during the day. The crawler chassis test platform is used to drive on a random path in the open space inside the electromechanical test building, and the raw camera image, IMU raw data and GNSS positioning data are collected at the same time. The total length of the path is about 113 m.

Fig. 6
A photo of a spacious, paved outdoor area in front of a modern multi-story building.

Test environment

Figure 7(a) is the comparison between the trajectory of the method in this paper and VINS-Fusion after running the recorded real data set, and the real trajectory of GNSS. It can be seen that the overall positioning effect of the algorithm in this paper is slightly better than that of VINS-Fusion, it can achieve basic and real trajectories. Figure 7b, c shows the relative pose error between our improved method and the original algorithm in the actual environment. Table 2 shows the RPE comparison between the proposed method and the original method in real scene. It is obvious that the error of the proposed algorithm is significantly lower than that of the original algorithm. This coincides with the validity of the method in this paper.

Fig. 7
Three line graphs labeled a to c. Graph A of y versus x plots lines for ground truth, ours, and V I N S fusion in an 8 shape. Graphs B and C of R P E versus t plot 3 stable lines for r m s e, median, and mean, the shaded regions between the lines for s t d, and a fluctuating line for R P E.

a Trajectory comparison chart in real scene. b The relative pose error of our method in real scenes. c The relative pose error of VINS-Fusion algorithm in real scene

Table 2 Algorithm error comparison in real scene experiment

5 Conclusion

Based on the VINS-Fusion system framework, this paper proposes to use the improved RANSAC algorithm to reduce mismatching in the feature tracking step, and iterates only in the matching points whose Hamming distance is less than the set threshold. At the same time, in the nonlinear optimization link based on the sliding window, the Huber kernel function is used to preprocess the larger residual items for both the IMU residual and the visual residual. The algorithm comparison test was carried out on the public dataset EuRoC. The experimental results show that the positioning root mean square error of the improved method is reduced by an average of 12.41% compared with the original algorithm, which effectively improves the accuracy of the SLAM system; the experiment in the actual scene is also verified the effectiveness of the method.