FSD-SLAM: a fast semi-direct SLAM algorithm

Current visual-based simultaneous localization and mapping(SLAM) system suffers from feature loss caused by fast motion and unstructured scene in complex environments. Addressing this problem, a fast semi-direct SLAM algorithm is proposed in this paper. The main idea of this method is to combine the feature point method with the direct method in order to improve the robustness of the system in the environment of scarce visual features and low texture. First, the feature enhancement module based on subgraph is developed to extract image feature points more stably. Second, an apparent shape weighted fusion method is proposed for camera pose estimation, which can still work robustly in the absence of feature points. Third, an incremental dynamic covariance scaling algorithm is studied for optimizing the error of camera pose estimation. Finally, based on the optimized camera pose, a face element model is designed to estimate and fuse the point cloud pose, and obtain an ideal three-dimensional point cloud map. The proposed algorithm has been tested extensively on the benchmark TUM dataset and the real environment. The results show that the algorithm has better performance than existing visual based SLAM algorithms.


Introduction
Simultaneous localization and mapping (SLAM) is a key technology for mobile robots used in detection, navigation, and rescue in unknown environments. The robot uses the information acquired by the sensors to reconstruct the environment, and estimates the real-time position of itself according to the reconstructed part. With the development of depth cameras, many studies on visual SLAM (VSLAM) systems adopt RGB-D cameras as sensors [1].
Current VSLAM methods mainly have two directions: the feature point methods based on feature match and the direct method without the need of feature points [2][3][4][5][6][7][8][9][10]. Due to B Long Cheng long.cheng@ia.ac.cn B Teng Li liteng@ahu.edu.cn 1 the need to extract feature points and calculate descriptors, feature method is not only time-consuming, but also cannot be applied to the environment with low texture and other features missing. Compared with the feature method, the direct method estimates the camera pose directly according to the brightness difference of pixels. Since there is no need to calculate feature points, the direct method runs very fast. However, the direct method is so sensitive to the camera's internal parameters and exposure that the camera's pose is easily lost when the camera moves quickly, resulting in the direct method is less accurate than the feature point method in the camera's pose estimation.
To address the above-mentioned problems, this paper proposes a fast semi-direct SLAM algorithm (FSD-SLAM). The contributions are as follows: 1. A feature enhancement module is designed for stable feature extraction. With this module, the subgraph which is the image with feature enhancement is used to improve the efficiency of feature extraction. 2. An apparent shape weighted fusion method is proposed for accurate camera pose estimation. Based on the apparent shape weighted fusion method, a camera pose estimation algorithm is designed to achieve accurate cam-era pose estimation in hard scenes with fast motion, no structure, and missing features. 3. An incremental dynamic covariance scaling algorithm is designed for efficient global optimization. Addressing the problems of high computational load of traditional backend optimization algorithms, the incremental dynamic covariance scaling algorithm is designed for global optimization, which improves the efficiency of the optimization. 4. A face element model is proposed for robust point cloud fusion. In the process of map construction, there are often staggered layers during the superposition process, which cannot be perfectly superimposed. To solve this problem, the surface element model is stuided for point cloud fusion to improve the success rate of the map construction.
The rest of this paper is organized as follows. In the second section discusses the related works. In the third section introduces the proposed SLAM algorithm in detail. In the fourth section analyzes the results of algorithm evaluation. Finally, in the Conclusion section, we give the conclusion and the potential application scenarios.

The related works
In the field of SLAM, we summarize the related works in three directions: feature point-based method, direct method, and hybrid method.

The feature point-based method
Henry et al. [11] at the University of Washington first used RGB-D camera to implement SLAM algorithm. They used the SIFT algorithm [12,13] to extract features, and used known depth data to generate 3D-3D feature point pairs, and adopted the ICP algorithm to iteratively solve the optimal camera pose. Endres et al. [14] implemented the RGB-D SLAM system. The system uses the SURF algorithm [13,15] for feature point detection and descriptor calculation, and uses an improved ICP method to estimate the camera pose. In 2014, Artal and Tardos et al. [16] designed ORB-SLAM system. Later, on the basis of ORB-SLAM, ORB-SLAM2 [17] was proposed. ORB-SLAM absorbed the advantages of PTAM [18] and adopted ORB Algorithm [19] to extract feature points and calculate descriptors. It is a typical representative of SLAM system based on feature point method. However, ORB-SLAM also has obvious shortcomings, such as a large amount of calculation, the effect of the reconstructed map is not ideal when facing the missing area of the feature point, and so on.

The direct method
Engel et al. [20] proposed LSD-SLAM through baseline stereo registration between a large number of pixel pairs, which was performed for pixels and realized the reconstruction of a semi-dense map on CPU. However, in the phase of camera pose optimization, it is still necessary to use feature points for closed-loop detection, so this method does not completely get rid of the dependence on feature points. In terms of accuracy, LSD-SLAM still has a big gap compared to SLAM methods based on feature points such as RGB-D SLAM and ORB-SLAM.

The hybrid method
Forster et al. proposed a semi-direct visual odometer named SVO [21]. In SVO, the author uses a direct method to track sparse feature points to complete the camera motion estimation. The high speed of SVO is its biggest advantage, which is far ahead of other SLAM systems. However, SVO is designed for the UAV's downward-looking camera, which cannot adapt to complex robot movements, and has high requirements on the environment. Moreover, SVO adopts monocular camera, which cannot recover the real pose information and scene scale. In addition, SVO is a visual odometer, which cannot achieve the closed-loop function to improve the consistency of positioning and mapping. The work done by Krombach et al. [22] also combined the direct method SLAM (LSD SLAM) and the feature method VO (LIB-VISO2) [23], and use the feature method to obtain camera pose initialization of the direct method. This algorithm can obtain semi-dense depth map, but the accuracy of the algorithm's visual odometer needs to be improved.
Although the previous proposed VSLAM algorithms succeeded, they may fail in hard environments. In our method, using the feature accumulation module combined with the feature point information and photometric information of the image, the stable and accurate camera pose estimation can be achieved in various complex scenes. More importantly, our system has improved the camera pose optimization and the entire map reconstruction process to achieve better performance.

The proposed fast semi-direct SLAM
The main idea of the FSD-SLAM system is to combine the feature method and the direct method at the same time. The system framework is shown in Fig. 1. There are three main stages: initialization and feature enhancement, pose estimation and optimization, and 3D mapping. The following is a detailed description of the three stages divided into four small parts.

Initialization and feature enhancement
First, the entire system is initialized. With the input of video frames, the image features of two adjacent frames are calculated and matched. When there are enough feature matching points, the depth of the map point associated with the feature and the 3D coordinates of the map point are solved to complete the initialization.
Second, it is a challenge for VSLAM system to extract the key points of motion blur and textureless environment. FAST key points [24] belong to a kind of feature points that are particularly fast in calculation. The feature enhancement module based on subgraph can enhancement features of FAST key points. According to the reliability of robot ranging, it can generate subgraphs through the enhancement of three-dimensional feature points, which is convenient for the system to extract image feature points stably. After the initialization of the system, the video frames are put into the fixed timeline Q t queue. The feature enhancement module of the subgraph acts on the Q t header, outputs the RGB frame with FAST feature point enhancement, and updates the output results back to the timeline Q t queue. Figure 2 shows how this module works in an inadequate feature environment. It is realized by enhancing three-dimensional feature points in an environment with insufficient features. The enhanced 3D point l i of the feature L in the ith subgraph is shown as follows: Here, v j represents the 3D feature point in the jth node, F G D is the coordinate transformation from the camera to the robot, and H j represents the transformation expression of the jth ranging. v and m represent the first and last node index of the subgraph, respectively.

Camera pose estimation based on weighted fusion
To calculate the camera pose more accurately, this paper proposes to fuse the reprojection information of the matching feature point pairs and the corresponding pixel photometric information to estimate the camera pose. This method can overcome the problem that the feature point method is difficult to accurately solve the camera pose in the environment of lacking texture feature information. High-quality camera pose estimation can directly improve the real-time and accuracy of whole SLAM system. Among them, c 13 is the new camera pose estimated by the motion model, c 12 is the actual camera positions, R and t, respectively, represent the rotation and translation matrix of the current frame relative to the previous frame. This pixel coordinates of the spatial point P 1 on the two frames of images are p 11 and p 13 , respectively. If the camera pose is accurate, the real positions p 12 and p 13 should be the same point. If the camera pose is not correct, there will be a reprojection error e between the projection points p 13 and p 12 . c 13 is used to reproject the visible map points of the previous frame onto the new image, data association is carried out in a search window s w around the projection points, and then the reprojection error e is minimized by using euclidean transformation parameters to calculate the camera pose T to be estimated. Without loss of generality, consider any spatial point P 1i , e is defined as follows: Considering all the feature points, the total reprojection error is as follows: In Eqs. (2) and (3), p 1i is the two-dimensional pixel point corresponding to the spatial point P 1i on the second frame image; T represents camera pose to be estimated; K represents the camera's internal parameters; Z 1i represents the coordinates of the relative Z axis after the projection of P 1i onto the camera frame; N represent the sum of feature matches. To strengthen the optimization effect of the corresponding feature point errors, the photometric errors of matching feature point pairs are considered on the basis of reprojection error optimization. The camera pose is optimized and calculated by fusing optimized reprojection and photometric errors. The photometric error is shown in Fig. 4.
Suppose that the spatial point corresponding to a pixel p 21 in the previous frame image is P 2 , and the pixel position after projecting onto the second frame image using the camera pose is p 22 . If the camera pose is not accurate, there is an error in the photometric between the two positions. By optimizing iterative solution, the photometric error between them can be continuously reduced, and finally the optimal camera pose can be obtained. According to the above description, the photometric error of a pixel can be defined as: Then, the sum of the photometric error functions of multiple pixel pairs is: In Eqs. (4) and (5), p 2 j represents the pixel coordinates in the first frame; P 2 j represents the spatial point coordinate of the point p 2 j ; Z 2 j represents the coordinates of the relative Z axis after the projection of P 2 j onto the camera frame; I 1 () and I 2 () represent the photometric information of the pixels corresponding to the previous frame image and the current frame image; W represents the total number of pixels.
By combining the reprojection error of matching feature points with the photometric error of pixel pairs, the camera pose estimation value can be solved as follows: In the formula, λ is a weight coefficient, which is used to linearly adjust the role of reprojection error and photometric error in camera pose optimization. This method improves the camera pose optimization effect by fusing and optimizing the matching errors of the sampled data due to changes in the baseline and illumination conditions, respectively. Among them, the weight λ takes the empirical value of 0.5.

Global optimization of incremental dynamic covariance scaling algorithm
The system estimates the camera pose from the images between adjacent frames, and connects the poses of all adjacent moments to form the robot's motion trajectory. Since there will be a certain error in the estimation of camera pose every time, the cumulative error will be formed with the increase of the input image. To address this problem, two techniques are required: close loop detection and global optimization. The close loop detection adopts a fast detection method based on visual word bags to judge whether the robot has been to this place. To obtain consistent trajectories and maps, the global optimization needs to optimize the close loop detection information and the error of camera pose estimation at different moments.
The representative global optimization algorithm is SC (Switchable contraint) [25] algorithm. This algorithm proposed in the paper is an improvement to SC algorithm. This SC algorithm expresses the goal function of the SLAM optimization problem as follows: Equation (7) shows that the solution of SLAM optimization problem includes node state variable Y * and switch variable G * . The right side of the equation is decomposed into three parts, which are sequentially expressed as: pose estimation constraints, closed-loop constraints with added switching variables, and switching variable constraints. Among them, g i j is the switch variable, Ψ (g i j ) is the zoom function, Ξ i j is a switch prior, a i is the robot pose at variable i, x i j represents the conversion between camera poses at adjacent moments, and y(a i , a j ) is the measurement value function that predicts between a i and a j . The covariance of odometer and sensors measurement is Σ and Λ. It can be seen from Eq. (7) that the SC algorithm needs to solve the joint optimization of the switch variable, pose and closed loop at the same time, so the calculation amount is large and the real time is not strong. Based on these problems, we improve it.
To get a better solution, we consider the adjacent node v and u and the edges between them. The error can be divided into two parts: including the edge except vu and considering only the edges containing vu. The objective function can be redefined: Set q = Y * , G * , and arrange Eq. (8) to get: Here, s(·) represents all the errors except those containing edge uv, and h(·) means the errors containing edges uv. When optimizers converge to a certain degree, it takes partial derivative of each variable in the entire function, and the result is zero. Therefore, the partial derivative of the switch variable g uv in Eq. (9) can be obtained as follows: Using the equality of both sides of the equation, formula (10) can be simplified as follows: By substituting the obtained g into h(·), we can get: whereĥ is the value obtained when the derivative of the function h is taken as 0, and finding the maximum value of this function is equivalent to obtaining the upper limit of χ 2 v , so as to obtain all possible solutions calculated by the optimization program. The derivative with respect toĥ is: It can be seen from Eq. (13) that when χ v takes 0, the derivative is equal to 0, and the function h takes the minimum value at this time. When χ v takes 0 or tends to positive and negative infinity: Then, it is easy to know 0 ≤ h ≤ Φ. We need to consider all the edges between nodes u and v, that is to say, for all switching variables, the constraint 0 ≤ h ≤ Φ is satisfied. So there are: From equation (15), we can get: At this point, the optimal value range of the switching variable is found. In this way, we not only retain the advantages of the original SC algorithm, but also effectively reduce the value range of the switch variable and increase the accuracy of global optimization.

3D mapping
After obtaining the high-precision camera pose through the optimized processing,the point cloud data are obtained by back-projecting each image pixel into a three-dimensional space, thereby performing point cloud fusion to obtain a high precision dense three-dimensional map.
In this paper, the surface element model is used as the minimum processing unit. For each surface element, the location information (X , Y , Z ), vertex vector n, color informations (R, G, B), surface element radius r , weight w, and point acquisition time information are t are stored. The global optimization is achieved by constructing the deformation graph. The deformation graph is composed of many sides and nodes. Each node g n stores position information g g , rotation matrix g R , translation matrix g t and time information g t 0 .
The rotation matrix and translation matrix are parameters to be optimized.
Any three-dimensional structure in the space can be composed of a surface element model, and any surface element M S is controlled by the node in deformed map C(M S , g). When the new positions parameter are obtained by node optimization, it will affect the surface element, including the change of the normal vector and the position change of the surface element. After the node is optimized, the affected surface element location becomes: Among them, M S P represents the coordinates before deformation,M S P represents the coordinates after surface element position deformation, and g n R and g n t are the positions and postures after node optimization. The surface element normal vector becomes: Among them, M S n represents the normal vector of the surface element before deformation,M S n represents the normal vector of the surface element after deformation, and U represents the transformation relationship between the surface element before and after deformation. w n (M S n ) represents the weight of node g n 's influence on the surface element model, and the weights of the node that have influence on the current face are added together to 1: Among them, d max represents the peak value of the Euclidean distance between surface element and all nodes in the neighborhood.
After the construction of the deformed map is completed, the pose parameters are further optimized through the graph optimization. For the two point clouds to be fused, there are many corresponding point pairs, from which some point pairs are uniformly extracted to establish constraints. The established constraints are as follows: Among them, Q P d represents the position of target point cloud, T S represents the pose of camera in current frame, T d represents the camera pose in which the loopback frame is detected, P(u, D a t ) represents the point obtained from this model projection through the pose of previous frame, and Q P S represents the position in the frame point cloud, Q P d t and Q P S t represent the timestamp corresponding to Q P d and Q P S . The graph optimization is to optimize the points at these closed loops, using the following four cost function optimizations.
First, the Forbenius norm is used to optimize the pose of each node to ensure that the pose matrixis R is orthogonal matrix, and the square error function is: Second, it is necessary to normalize the nodes so that the pose parameters of the two neighbor nodes are uninterrupted, and the corresponding square error function is: Combine Eq. (17) to solve the deformed coordinatesQ P d andQ P S of Q P d and Q P S , and align and optimize the point clouds in the two adjacent frames into target point cloud, and the cost function is as follows: Finally, because the target point cloud was previously generated, it is necessary to optimize the coordinates of the target point cloud. This square error function is as follows: Based on the above formulas (21)-(24), the final total optimization function can be constructed to optimize the solution deformation graph, and the total cost function is as follows: Among them, the empirical value of each weight is: The optimized camera pose is provided to the map construction program, and the point cloud is fused and optimized through the surface element model to obtain the final 3D reconstruction map.

Experiments
We evaluated the proposed FSD-SLAM system in the public TUM dataset [26] of the Technical University of Munich and in the real environment. The experimental platform in the real environment is turtlebot 2, and the equipped sensor is KinectV1. The operating system is Ubuntu Linux 18.04 LTS and ROS Melodic.

Evaluation standard
The indicators commonly used to evaluate VSLAM are absolute trajectory error (ATE) [27] and relative pose error (RPE) [28]. In addition, to provide a measure independent of the scene length, indicators usually need to evaluate the average value of the evaluation (Mean), median (Median), minimum (Min), maximum (Max) and root mean square error (RMSE). Given the estimated trajectory with translation component x = x 1 , . . . ,x n , and the real ground trajectory with translation componentx = [x 1 , . . . , x n ], RMSE is given by Eq. (27).

Evaluation on TUM RGB-D dataset
From the earliest Moravec, to Harris, to SIFT, SUSAN, and SURF algorithms, it can be said that feature extraction   Table 3 Freiburg1_room sequence was tested for the translation error of relative pose after optimization    algorithms are endless. The features extracted by the above algorithms such as SIFT and SURF are also excellent, but time consumption is very large. This makes the real-time performance bad and reduces the performance of the system. After research, Edward Rosten and Tom Drummond proposed the FAST algorithm [24]. The FAST detector is very fast because it does not involve complex operations such as scale, gradient and so on.
To verify the effect of the feature enhancement module in the system of this paper, for each image of the fr3_long_office sequence in the TUM dataset, the feature extraction algorithm mentioned above and FSD-SLAM are used for feature extraction, and the feature extraction efficiency of each image is calculated. Then we take Min, Max, Mean and RMSE for the feature extraction efficiency of each algorithm in the whole sequence, and Fig. 5 shows this result. It can be seen from the experimental result that FSD-SLAM can extract more feature points in a short time and has better feature extraction efficiency.
To qualitatively analyze the accuracy of FSD-SLAM for camera motion estimation in complex scenes, experiments are carried out on the TUM dataset with FSD-SLAM, FOVIS [29] based on the feature point method, and DVO [30] based on the direct method. Table 1 is the comparison result of the method in this paper with the other two methods in terms of registration speed and accuracy. h represents the average data feature extraction and registration processing time per frame. Figure 6 is a comparison diagram of the pose estimation results of the method in this paper and the other two methods on fr1/desk sequence. The fr1/desk sequence is image data recorded by KinectV2 at full speed (30HZ). The resolution of the sensor is 640 × 480. It was shot in a typical office environment.
Through comparative experiments, it can be seen that the three algorithms can correctly track the camera's trajectory.
However, it can be seen from Table 1 that FSD-SLAM is significantly better than DVO and FOVIS in terms of registration speed, and can reach more than ten frames per second.
The evaluation of the accuracy of the algorithm is carried out in this paper, the optimized ATE and RPE are used as the evaluation criteria, and the freiburg1_room sequence in the TUM dataset is selected to test the optimization work effect, and compared with the representative RGB-D SLAM [14] optimization effect. The ATE result after trajectory optimization during the test is shown in Fig. 7 and the RPE results is shown in Fig. 8. It can be seen from the experimental result that, under the same video sequence freiburg1_room, the RPE and ATE of FSD-SLAM algorithm are only about 10% compared with the RGB-D SLAM algorithm.
The results of quantitative evaluation of the optimized ATE are shown in Table 2. Quantitative evaluation of optimized RPE is divided into translation errors and rotation errors. The results are shown in Tables 3 and 4, respectively. According to the result in Tables 3 and 4, the positioning accuracy after FSD-SLAM optimization reaches 0.472%, and the positioning accuracy after RGB-D SLAM optimization is 0.341%, which means that the positioning accuracy after FSD-SLAM optimization is 38% higher than that after RGB-D SLAM optimization.
Further, we verify the robustness of the system in the environment of visual feature scarcity and low texture. Verification experiments are carried out on the image sequence with or without texture and structure in the TUM dataset, and compared with the ORB-SLAM2 [17] system. The evaluation standard is to run the system 20 times, and the number of times the system is lost. The experimental results are shown in Table 5. The bold results indicate less lost times. It can be seen that the ORB-SLAM2 system is lost in the image sequence of fr3_nostructure_far, fr3_nostructure_noture_near, fr3_nostructure_texture_far and fr3_nostructure_texture_near, and our system can track these sequences well. Overall, our system has less lost times than the ORB-SLAM2 system that only uses the feature method in most environments without texture and structure. Another purpose of the algorithm proposed in the paper is to improve the real-time performance of SLAM system. To evaluate this system more effectively, the time of tracking a single frame image is taken as the evaluation standard in the dataset of TUM, and compared with ORB-SLAM2 system. Table 6 shows the tracking time of a single frame image under the same parameters for the system in this article and the ORB-SLAM2 system.
In Table 6, MONO, STEREO and RGB-D, respectively, represent three different SLAM processing modes of monocular, binocular and RGB-D [17]. The result in bold indicates faster tracking time. It can be seen that the FSD-SLAM takes less time to track a single image in most cases, which improves real-time performance of SLAM system.

Evaluation in real environment
From the start of the robot to the end of the work, the corresponding duration of continuous movement is 48.90 s, and the actual track length of the ground is 3.986 m, the average translation speed of 0.432 m/s and the average angular speed of 29.882 rad/s. The test compares the ATE and RPE of the proposed method and several representative SLAM methods in camera pose estimation. These SLAM methods are RGB-D SLAM, HOOFR SLAM [3], ORB-SLAM, BAD SLAM [31] and DOOR-SLAM [32]. The results are shown   Table 7 and he bold item is the best. Figure 9 shows the initialized point cloud map and global dense point cloud map obtained by the system. The experimental results in Table 7 show that this work outperforms the real-time SLAM method and achieves a lower error rate. At the same time, we can see from Fig. 9 that the system of this paper can reconstruct the dense global point cloud map of the indoor environment.

Conclusion
The FSD-SLAM algorithm proposed in the paper uses the feature enhancement module based on subgraph in the initialization and feature enhancement part, which improves the efficiency of feature extraction. Combining the feature point information and photometric information of the image at the same time in the pose estimation part of the camera can realize stable and accurate pose estimation in the environment with scarce visual features and low texture. The incremental dynamic covariance scaling algorithm proposed in the optimization part optimizes the estimated camera pose globally, which simplifies the optimization goal, avoids a large number of calculations, and improves the positioning accuracy. Finally, a 3D point cloud map is constructed through map fusion. Both the TUM data set and the actual scene online experiment show that FSD-SLAM maintains a high accuracy rate and positioning accuracy. The validation test in the data set shows that FSD-SLAM has a small number of lost track times in most untextured and unstructured environments, and has high real-time performance.
With the advantages of excellent precision, high reliability, good efficiency and robustness, the FSD-SLAM algorithm could be applied to the navigation of service robots in noisy environment of large shopping malls, the navigation and positioning of AGV in automatic warehouses, the automatic parking service of automatic parking system and the rapid patrol of commercial buildings. Our next work is to combine the semantic mapping method, so that the algorithm can be applied to indoor and outdoor scenes.