Visual-LiDAR SLAM Based on Unsupervised Multi-channel Deep Neural Networks

Recently, deep learning techniques have been applied to solve visual or light detection and ranging (LiDAR) simultaneous localization and mapping (SLAM) problems. Supervised deep learning SLAM methods need ground truth data for training, but collecting such data is costly and labour-intensive. Unsupervised training strategies have been adopted by some visual or LiDAR SLAM methods. However, these methods only exploit the potential of single-sensor modalities, which do not take the complementary advantages of LiDAR and visual data. In this paper, we propose a novel unsupervised multi-channel visual-LiDAR SLAM method (MVL-SLAM) which can fuse visual and LiDAR data together. Our SLAM system consists of an unsupervised multi-channel visual-LiDAR odometry (MVLO) component, a deep learning–based loop closure detection component, and a 3D mapping component. The visual-LiDAR odometry component adopts a multi-channel recurrent convolutional neural network (RCNN). Its input consists of front, left, and right view depth images generated from 360∘\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$360^{\circ }$$\end{document} 3D LiDAR data and RGB images. We use the features from a deep convolutional neural network (CNN) for the loop closure detection component. Our SLAM method does not require ground truth data for training and can directly construct environmental 3D maps from the 3D mapping component. Experiments conducted on the KITTI odometry dataset have shown the rotation and translation errors are lower than some of the other unsupervised methods, including UnMono, SfmLearner, DeepSLAM, and UnDeepVO. Experimental results show that our methods have good performance. By fusing visual and LiDAR data, MVL-SLAM has higher accuracy and robustness of the pose estimation compared with other single-modal SLAM systems.


Introduction
Simultaneous localization and mapping (SLAM) plays a key role in many fields, such as autonomous robot navigation, localization [1,2], and self-driving systems. Traditional SLAM methods [3,4] use manual design features to carry out inter-frame matching and loop closure detection. However, the parameters of these methods have to be set in various scenarios to achieve better performance. Recently, with the advance of deep learning techniques, significant achievements of deep learning-based methods have been made [5,6]. Compared to traditional methods, deep learning methods can automatically extract features without manual designs, and they could perform better in some challenging scenes.
Supervised training methods are employed by some deep learning systems, which need ground truth data. However, the acquisition of ground truth data is difficult and expensive. Recently some researchers proposed unsupervised SLAM algorithms [7] to avoid the use of labelled data.
Unsupervised learning methods reduce the difficulty of acquiring training data, so it is easier to expand the training dataset. Given a larger dataset, the pose and depth estimation accuracy and robustness can be further increased.
Most unsupervised learning SLAM methods only use single-modal data like RGB images or light detection and ranging (LiDAR) data. Visual SLAM requires relatively stable lighting changes, and some of them only use monocular images, which cannot obtain the absolute scale directly. Compared to visual SLAM, LiDAR SLAM has higher accuracy. However, the collected 3D point cloud data are distorted due to the moving of LiDAR sensors. Moreover, the vertical resolution of LiDAR sensors is low.
In this paper, we propose a novel unsupervised visual-LiDAR SLAM method to compensate for the weakness of each sensor and explore the complementary advantages. Our SLAM consists of an unsupervised visual-LiDAR odometry component, a deep learning-based loop closure detection component, and a 3D mapping component. The core of our odometry component is a multi-channel recurrent convolutional neural network (RCNN). Its input includes monocular RGB images and multi-channel depth images generated from 3D LiDAR data. Our loop closure detection component adopts a convolutional neural network (CNN) to detect the loop closure. Then, the general graph optimization (g2o) [8] is used to conduct the global graph optimization. Finally, a global 3D map can be constructed from our 3D mapping component. Figure 1 shows the framework of our system.
Since the system adopts an unsupervised training method, no ground truth data is used. During the training process, consecutive RGB images and multi-channel depth images are fed into the network. The outputs of the network are 6D pose and 3D maps. Our experiments are based on the KITTI odometry dataset [9]. Results have shown that our SLAM and odometry are better than some of the state-of-the-art unsupervised visual odometry (VO) and SLAM methods in terms of translation and rotation accuracy.
Our main contributions are summarized as follows: • Our SLAM method fuses multi-modal data, including RGB images and LiDAR data, to estimate the pose and 3D map, improving the accuracy and robustness of pose estimation. • An unsupervised learning method is proposed for the multi-channel visual-LiDAR odometry component, which reduces the costs of training dataset collection. • Our loop closure detection is implemented with a CNN model to extract the loop closure information from RGB images and multi-channel depth maps.
The rest of the paper is organized as follows. "Related Works" describes the literature related to visual or LiDAR SLAM methods. "Our SLAM System" shows our SLAM method in detail. Experimental results are listed in "Experiments". And the last part is a summary and the future work.

Traditional Visual SLAM
Davison et al. [10] proposed a monocular SLAM method (MonoSLAM). MonoSLAM is the first real-time monocular SLAM, which is based on extended Kalman filter (EKF). MonoSLAM extracts Shi and Tomasi features [11] from monocular images to estimate the pose. Endres et al. [12] proposed RGB-D SLAM V2. Oriented fast and rotated brief (ORB) [13], features from scale-invariant (SIFT) [14], and speeded up robust features (SURF) [15] can be used in the feature extraction stage. Mur-Artal and Tardós [16] proposed ORB-SLAM2, expanding the previously proposed monocular SLAM to monocular, binocular, and RGB-D SLAM.

Supervised Visual SLAM
Since the significant achievement of deep learning methods in image recognition and classification tasks, many researchers have introduced deep learning methods into their visual SLAM methods. Compared with the traditional methods, deep learning-based methods automatically perform feature extractions, feature matching, or complex geometric operations, which makes the deep learning-based methods more attractive. Kendall et al. [17] firstly applied the CNN to VO. The PoseNet they proposed takes monocular images as the inputs, and the outputs are 6D poses. Handa et al.  [18] extended the spatial transform network [19] to deliver an RGB-D VO method. The network is inspired by the VGG-16 network [20], and it also can estimate depths from the monocular RGB images. Wang et al. [21] extended the CNN by adding a recurrent neural network (RNN) structure. The CNN extracts features, and the RNN performs feature matching. However, all the above methods need ground truth data.

Unsupervised Visual SLAM
The main advantage of unsupervised methods is no need for ground truth data which decreases the difficulty of acquiring training data. Zhou et al. [22] proposed an unsupervised system to recover poses and depth information from videos, paving the way for unsupervised visual VO methods. However, this system only uses RGB images, and the predicted results do not include an absolute scale. Mahjourian et al. [23] proposed an unsupervised system using 3D geometric constraints. Except for the pixel-wise or gradient-based information in small local neighborhoods, the system also considers the 3D geometry of a scene to enforce the consistency of the estimated 3D point clouds and ego-motion across consecutive frames. The principle is similar to iterated closest point (ICP) [24]. Liu et al. [25] proposed an unsupervised monocular VO method, but depth information is still needed for training. Li et al. [7] proposed an unsupervised monocular SLAM method (DeepSLAM). It requires stereo images for training and monocular images for testing. However, DeepSLAM does not use LiDAR data.

Traditional LiDAR SLAM
Compared with visual methods, LiDAR SLAM methods have higher accuracy. However, the acquisition frequency of LiDAR sensors is lower. Zhang and Singh [3] proposed a LiDAR odometry method (LOAM), which extracts and matches geometric features in Cartesian space and has a lower requirement on the cloud density. LOAM includes two algorithms. One algorithm performs odometry at a high frequency but with low accuracy. The other algorithm runs at a lower frequency for mapping. Then, they fused visual and LiDAR sensors and proposed a visual-LiDAR odometry method (VLOAM) [26] based on LOAM. Compared with LOAM, VLOAM has higher accuracy. VLOAM includes visual odometry to estimate ego-motion at a high frequency and LiDAR odometry to refine the motion estimation at a lower frequency. Deschaud [27] proposed a 3D LiDAR SLAM method based on a scan-to-model matching framework, which used the implicit moving least squares (IMLS) surface representing LiDAR sweeps.

Supervised LiDAR SLAM
Li et al. [28] proposed a supervised LiDAR odometry (LO-Net) method, which has the similar accuracy with LOAM. LO-Net can be trained in an end-to-end manner. LO-Net can learn feature representation from LiDAR data efficiently. Cho et al. [29] proposed DeepLO. DeepLO is based on a geometric constraint and incorporates the ICP algorithm into a deep learning framework. Li et al. [30] proposed a semantic SLAM method including a semantic segmentation network. The system can construct a 3D semantic map. Lu et al. [31] proposed a learning-based LiDAR localization system (L3-Net). L3-Net achieves centimeter-level localization accuracy. 3D convolutions are used to enhance the accuracy.

Unsupervised LiDAR SLAM
Yin et al. [32] proposed a LO method, which is based on an unsupervised convolutional auto-encoder structure (CAE-LO). However, CAE-LO is not an end-to-end method, which only uses a neural network to extract features from 3D LiDAR data. Cho et al. [33] proposed an unsupervised LiDAR odometry method based on geometric information (UnGLO). UnGLO can output the poses directly. UnGLO utilizes a 2D spherical projection for input representation and uses point-to-plane ICP to formulate the loss function.

Our SLAM System
Our SLAM consists of an unsupervised visual-LiDAR odometry component, a deep learning-based loop closure detection component, and a 3D mapping component. The loop closure detection is used to decrease the accumulated errors of our odometry. The 3D map is constructed by using an optimized global pose. Figure 1 shows the overview of our SLAM system.

Visual-LiDAR Odometry Component
The odometry needs to predict the pose from a consecutive input sequence. We propose an RCNN to be the core of our odometry. The network adopts an unsupervised training framework. The inputs are RGB images and multi-channel depth images generated from 3D LiDAR point clouds. The outputs are the 6D pose with an absolute scale.

Data Preparation
To ensure the unity of the network structure, our odometry does not use LiDAR data directly. We convert the 3D LiDAR data to multi-channel depth images. The multi-channel depth images include three depth images from left, right, and front views. The purpose of using multi-channel depth images is to get more information about the sensor motion. We project the LiDAR data onto the imaging plane to obtain the front depth images, which correspond to RGB images at the pixel level. I = {I t |1 ≤ t ≤ n} represents an RGB image sequence, and D f = {D f t |1 ≤ t ≤ n} denotes a front depth image sequence corresponding to I. Let q t = x t , y t , z t denote a 3D LiDAR point, and let p t = u t , v t be its projection on I t , and d t is the depth value of p t . The projecting process can be described as: where p t = u t , v t , 1 and q t = x t , y t , z t , 1 are the homogenous coordinates of p t and q t respectively. [R t] is the extrinsic parameter matrix (rotation matrix and translation vector) between the 3D LiDAR data and the camera. K is the intrinsic matrix of the camera. Through Eq. (1), d t can be solved. The corresponding depth value of I t namely D f t can be obtained. Due to the sparse distribution of point clouds, only a few pixel points have depth values. We use the barycentric interpolation method to fill the holes in the depth images.
For the left and right depth images, we build two virtual cameras generated from the real camera rotating 90 • clockwise and anticlockwise with itself as the center, and then we obtain the left and right cameras. Similarly, we use Eq.
(1) to project the LiDAR data onto imaging planes of the left and right cameras to get the left and right depth images. D l = {D l t |1 ≤ t ≤ n} and D r = {D r t |1 ≤ t ≤ n} represent consecutive left and right depth image sequences respectively. The LiDAR points that cannot be projected onto the three imaging planes (front, left, right cameras) are discarded.

Network Architecture
The network is a four-channel architecture. The inputs of the four channels are RGB images, front depth images, left depth images, and right depth images. The inputs of each channel are multiple consecutive images. We input five consecutive images to the network in training. The RGB images and the depth images are resized to 416 × 128 × 3 . The architecture of our odometry component is shown in Fig. 2.
The design of our CNN framework is based on Oxford's visual geometry group (VGG) [20]. We modify the VGG network. Each channel of the CNN starts with a 7 × 7 convolutional layer, followed by two 5 × 5 convolutional layers. Both are used to capture the basic features of images. The remaining structure is five 3 × 3 convolutional layers. We reduce the size of the convolutional layer to obtain more detailed features. In the proposed network, the strides used We also use zero padding to prevent the image size from decreasing too quickly. A rectified linear unit (ReLU) activation function follows each convolutional layer to introduce nonlinearity to the network. The convolutional layers are detailed in Table 1.
The CNN as a feature extractor converts images into feature vectors, while the RNN as a pose prediction module combines the feature vectors in consecutive time steps. The RNN is a long short-term memory (LSTM) structure. Compared with a standard RNN, the LSTM can avoid the problem of long-term dependence. The core of an LSTM is the cell state, which is controlled by three regulators, namely the forget, input, and output gates. The number of hidden units in an LSTM cell is 256. According to [34], we set the bias of the forget gate to 1.
The full connection layer follows the LSTM. The fully connected structure has two layers. The dimension of each layer is set to 256. The outputs of the full connection layer are the rotation and translation vectors, namely the 6D pose. The rotation vector is represented by 3 values using Euler angles, and the translation vector includes 3 values in the Euclidean space.

Loss Function
We use RGB images and multi-channel depth images to calculate the loss function. The loss function consists of two parts: 2D and 3D spatial loss. The 2D spatial loss is the difference of pixel values between a raw image and a reconstructed image. And the 3D spatial loss is the difference between a raw 3D point cloud and a reconstructed 3D point cloud. The illustrations of loss calculations are shown in Figs. 3 and 4 for the 2D and 3D spatial loss respectively. T t−>t+1 represents the predicted transformation matrix from the network.
• 2D spatial loss: We project the pixels of an RGB image from I t to I t+1 . Î t+1 is the reconstructed image, which is constructed by R t−>t+1 , t t−>t+1 , and I t . R t−>t+1 and t t−>t+1 represent the predicted rotation matrix and translation vector from the network. The reconstruction process of Î t+1 can be described as: where p t+1 is a pixel coordinate of Î t+1 , and the corresponding pixel value is same as p t . s is a scale factor. Since p t+1 is not a integer in most cases, it is projected to its four neighboring pixel points: top-left, top-right, bottom-left, and bottom-right as recommended in [19].
Similarly, we use p t to construct Î t . p t can be solved by Eq. (3). The process is described as: where p t+1 is the homogenous coordinate of p t+1 . • 3D spatial loss: We convert the depth images to the 3D space to generate a point cloud, that is, a pixel of depth images is converted to a 3D point. C = {C t |1 ≤ t ≤ n} is the point cloud generated from D f , D l , and D r in the camera coordinate. c t = x t , y t , z t is a 3D point of C t . Translating c t to the coordinate of C t+1 can be expressed by the following equation: The point cloud Ĉ t+1 is constructed by ĉ t+1 . The difference between Ĉ t+1 and C t+1 is used to calculate the 3D spatial loss. We can also use Eq. (4) backwards to reconstruct C t . The inverse equation is described as follows: where ĉ t can be used to construct the point cloud Ĉ t . The 2D spatial loss is described as: The 3D spatial loss is described as: We introduce two independent weights 2D and 3D to balance the 2D and 3D spatial loss. The total loss L is described as:

Loop Closure Detection Component
To reduce the accumulated errors of odometry, we use a loop closure detection component to find loop closures in the trajectory and improve the pose accuracy. The architecture of the loop closure detection component is shown in Fig. 5.
We adopt a pre-trained CNN for loop closure detection. The inputs of the pre-trained CNN are RGB images and multi-channel depth images, which are converted into feature vectors. We choose VGG19 to extract the feature vectors from the images.
At the time step t, we can obtain four images ( I t , D f t , D l t , and D r t ). The four images are converted to four feature vectors. We concatenate the four feature vectors to generate a new vector v t (1 ≤ t ≤ n) to describe a scene. By comparing the cosine distance between two feature vectors v i and v j (i ≠ j) , we can judge if the trajectory has a loop closure. The calculation equation of cosine distance is described as: where d cos represents the similarity between i and j . If d cos is higher than a threshold d TH , it can be considered that a loop closure has been found.
We convert the image sequence, including RGB images and multi-channel depth images, to feature vectors. Every scene has a corresponding feature vector. These feature vectors form a vector sequence V = {v t |1 ≤ t ≤ n} . We use Eq. (9) to extracting key frames. The first frame is set as the current key frame. The cosine distance between the current key frame and the next key frame cannot exceed d KEY . Following the vector sequence V, the next key frame can be found. Then, we set the next key frame as the current key frame. This process is repeated until the vector sequence V is traversed. Extracting the key frames also can reduce the If the cosine distances of all feature vectors from key frames are lower than the d TH , it is considered that there is no loop closure in the sequence. In this case, g2o is not used to optimize the pose.
After the loop closure detection, we feed the loop closure information and the predicted pose into the pose optimization system. Then, we use g2o [8] to optimize the predicted pose.

Training
The network was trained on a desktop with an Intel Core E5-1650 v3 @3.50GHz CPU and a Nvidia GeForce GTX 1080Ti 11GB Memory GPU.
Our code was based on Tensorflow. We used the KITTI odometry dataset as the training data, including 22 sequences captured by cars in cities, suburbs, highways. Ground truth data are provided in sequences 00-10, while 11-21 are not. UnGLO used 3D LiDAR data for training. SfMLearner and UnMono used monocular RGB images for training. UndeepVO and DeepSLAM used binocular RGB images for training. Our system was trained by monocular RGB images and multi-channel depth images generated from 3D LiDAR data. The size of training images was set to 416 × 128 × 3 , which was used by SfMLearner, UnMono, UndeepVO, Deep-SLAM, our SLAM, and odometry. The input image sizes of VISO2-Mono and VIS2O-Stereo were 1242 × 376 . Due to the relatively small training dataset, we enhanced the training data to ensure the system with stronger robustness. We enhanced the training data in the following aspects: -Luminance: We randomly changed the light intensity on RGB images, and the adjustment range was [0.7, 1.3]. -Scale: We randomly changed the scale of RGB images and multi-channel depth images. The adjustment range of X and Y were [1.0, 1.2], and then they were clipped to 416 × 128. -Rotation: We randomly rotated RGB and multi-channel depth images in the range of r ∈ [−5, 5] degrees.
After completing the above steps, the processed data were fed into the network. The optimizer of our network was Adam [37]. We adopted the recommendations for the first and second attenuation indices: 1 = 0.9 and 2 = 0.999 . The network inputs were five consecutive RGB images and multi-channel depth images in one batch. The batch size of the training was 32. Within a total training length of 40 epochs, the loss function value tended to be stable around 15 epochs under normal conditions.

Performance Evaluation
Compared to our odometry, our SLAM adds a loop closure detection component. According to the loop closure information, SLAM uses g2o to optimize the pose from the odometry. DeepSLAM adopted 00-02, 08, 09, and 11-21 sequences for training. Other unsupervised methods used 00-08 sequences as the training dataset. In the training process, UnMono, DeepSLAM, SfmLearner, and UnDeepVO are all trained by the enhanced training dataset. The experimental comparison is divided into four aspects: -Our odometry was compared with other odometry methods in 00-10 sequences. -Our SLAM was compared with our odometry in 00-10 sequences. -Our SLAM was compared with VISO2-Stereo and UnMono in 11-21 sequences. -The 3D map was constructed through the predicted global pose graph.
Since the KITTI odometry dataset is taken from an outdoor moving vehicle, only minor deviations were generated in the vertical direction. To clearly present the differences between trajectories, the y axis is omitted in trajectory diagrams. The initial point of each trajectory is set as (0,0). Intuitively, the closer the trajectory is to the ground truth (GT) curves, the higher the accuracy. The trajectories of odometry selected from 00 to 10 sequences are presented in Fig. 6. The trajectories of GT, our odometry, and UnMono are represented by black, blue, and green curves respectively. As can be observed, the trajectories of our odometry are closer to GT curves than UnMono. The accuracy of our odometry is higher than UnMono. It can be seen the importance of adding multi-channel depth data to the system, which can improve the predicted accuracy of trajectories.
The trajectories of our SLAM selected from 00 to 10 sequences are presented in Fig. 7. The trajectories of our SLAM are represented by red curves. As can be observed, the trajectories of our SLAM are closer to GT curves than our odometry. The accuracy of our SLAM is higher than our odometry. Due to the influence of accumulated errors, the prediction accuracy of turning angles plays a pivotal role. It can be seen, adding loop closure detection to our odometry, the deviation between prediction trajectories and GT trajectories was reduced.
The trajectories of our SLAM selected from 11 to 21 sequences are presented in Fig. 8. The trajectories of VISO2-Stereo are represented by black curves. Since there is no ground truth in 11-21 sequences, VISO2-Stereo is used as a reference. As can be observed, the trajectories of our SLAM are closer to VISO2-Stereo than UnMono, which again proves the importance of multi-channel depth data.
The detailed error analysis is shown in Tables 2 and 3. We computed the root mean square error (RMSE) of translation and rotation on the lengths of 100, 200,..., 800 meters. As can be observed, the average error of our odometry is less than UnMono. It can be seen the training and testing results are improved by adding the multi-channel depth images. Only the rotation errors of our odometry in sequences 04 and 05 are higher than UnMono. The pose transformation includes rotation and translation. When the sensors (the camera and LiDAR) go straight, the rotation angle is small. When the sensors turn, the rotation angle is large. The translation is a linear process, while rotation is a non-linear process. Therefore, rotation is more difficult to predict for the network. In the training process, UnMono adds many turning data to improve the rotation prediction accuracy. However, this may result in decreased translation accuracy. Therefore, the proposed method does not add too many turning data during the training process. The training data of turning and going straight are balanced, so the proposed odometry may have higher rotation errors than UnMono in some sequences.
Compared with other unsupervised visual odometry and SLAM methods, our SLAM system has higher accuracy. Compared with VISO2-Stereo, our SLAM and odometry methods automatically extract features without manual designs, and they could perform better in some challenging scenes. The data of sequence 01 are collected from the expressway, which is more difficult to estimate the pose than urban and rural streets. As can be observed, our SLAM and odometry methods perform better than VISO2-Stereo in sequence 01. It can be seen, our methods are more robust than VISO2-Stereo. Compared with UnGLO, our odometry has a lower translation error. Since UnGLO adopts LiDAR data directly, UnGLO has a lower rotation error. From the predicted results of sequences 09 and 10, which are not used for training, our odometry has higher accuracy. It can be seen our odometry is more robust than UnGLO, and by comparing the mean errors of our SLAM and odometry, it can be seen adding loop closure detection in the system can reduce translation and rotation errors. However, in some cases, the optimization of the posture may have some negative effects. For example, our SLAM has a higher rotation error than our odometry in sequence 07. The error of the odometry accumulates over time, the initial error is small, and as time increases, the error gradually increases. G2o can effectively reduce the gradually accumulated small errors. However, if the predicted pose of the odometry at a certain time produces an abrupt large error as shown in sequence 07, using g2o does not always have a good optimization result. Since g2o optimizes the global pose, in order to reduce this abrupt large errors, other poses will be over-corrected and new errors will be introduced. The 3D reconstruction on sequence 09 is shown in Fig. 9. The 3D LiDAR data was projected to the image plane, and the  RGB pixel points corresponding to the 3D points were found, that is, the RGB information of 3D points was obtained. We used the color point clouds to reconstruct the 3D scene. Finally, we analyze the time complexity of UnMono and our odometry. In the testing process, the batch size was set to 1. The time consumption of UnMono and our odometry is 25.63ms and 29.01ms respectively. Compared with UnMono, our odometry spends a longer running time due to the additional network structure.

Conclusions
In this paper, we proposed an unsupervised multi-channel visual-LiDAR SLAM method (MVL-SLAM), which consists of an unsupervised multi-channel visual-LiDAR odometry component (MVLO), a loop closure detection component, and a 3D mapping component. The estimation accuracy of poses is improved after adding multi-channel depth information generated from LiDAR data. The absolute scale can be obtained directly compared with other monocular systems. The loop closure detection uses a CNN model to extract features from the depth and RGB images. We combine the global poses, RGB images, and LiDAR data to construct 3D maps.
In the future, we will explore other options of combing LiDAR point clouds and RGB images together to improve its robustness under challenging circumstances.