1 Introduction

Navigation systems are critical components of most autonomous and non-autonomous robotic systems, and alternative localization systems are welcome, if not necessary, especially in environments where satellite navigation is most likely not available. In this work, we present a novel visual SLAM system for underwater camera-only navigation in natural environments.

Underwater imaging is a complex matter, particularly in a natural environment, as many factors intervene in lowering the image quality, like back-scattering from suspended particles and distortion effects introduced by light passing through different materials (water, glass, air). Nevertheless, there are several positive aspects that can be exploited, like a smooth motion of the vehicle, a much lower vehicle speed compared to other kinds of robotic platforms, and the capability to employ more computational resources, due to much lower restrictions on weight and available power compared to popular robotic platforms like multi-rotors. Our focus is on reliability, continuity, and robustness in the pose estimation, while generating enough mapping information for navigation, including obstacle avoidance. Our system is built upon ORB-SLAM (Mur-Artal & Tardós, 2017a), which is currently one of the most successful state-of-the-art visual SLAM systems.

Fig. 1
figure 1

VSLAM estimate generated on sequence 4 of the Aqualoc dataset. No other tested monocular VSLAM state-of-the-art system is able to produce a meaningful result for this sequence

The main contributions of this paper are:

  • Three view initialization procedure, which does not use a model selection procedure

  • A fast, exact solution applied to nearest neighbor search for triangulation, increasing tracking robustness

  • A model that predicts the camera pose in absence of visual information and map consistency

  • Partial synchronization between front-end and back-end

  • A procedure to detect station-keeping operations while using the available computing power for performing a global bundle adjustment

  • The same descriptor-matching solution applied for triangulating new points is applied for loop closures detection

  • A pruning procedure that enables lifelong operations

In summary, by strengthening the worst-performing elements of ORB-SLAM in underwater sequences and by adding new functionalities which specifically target the underwater scenario, we improved the overall performance of the system both in terms of median root mean square error (RMSE) by 34.6% and in terms of loop closure quantity by 33.3%. As a result, we can produce SLAM estimates on sequences where it was not possible before, see Fig. 1.

In the following section, a literature review is presented, together with an evaluation of systems for underwater visual navigation. Subsequentially, an overview in Sect. 2 of the proposed system and dedicated Sects. (39) for each contribution are presented. Finally, in Sect. 10 results are compared against ORB-SLAM and Direct Sparse Odometry with Loop Closure (LDSO) using in-air datasets and a newly published underwater dataset that targets visual odometry (VO) and visual SLAM.

1.1 Related work

1.1.1 Monocular visual odometry and SLAM

Recently (Pumarola et al., 2017; Maity et al., 2017; Jose Tarrio & Pedre, 2015), utilized edge-based features within the VO/VSLAM pipeline, but pointed out that these methods do perform better than point-only methods only in artificial environments. On the contrary underwater natural environments do rarely contain geometrically well-defined edges like in an office or street scene, so using edge features in this context can introduce false correspondences.

Work in visual odometry includes direct methods like direct sparse odometry (DSO) (Engel et al., 2018) and fast semi-direct monocular visual odometry (SVO)-2 (Forster et al., 2017). These methods rely either partially or even completely on the change of pixel intensity values. Usually, these methods are not able to achieve the same performance, in terms of quality of the estimation of geometry-based VO/VSLAM methods like ORB-SLAM, as they require photometric calibration to remove vignetting effects, control, and modeling of exposure, gain, and luminosity (amongst other camera-specific parameters). Direct methods operating in underwater environments would need to take into consideration that the intensity of pixels is a function of the observation distance (given that attenuation of light in water is wavelength dependent). Challenges would also arise from a likely uneven active illumination of the scene and the presence of dynamic elements in the scene, like fish swimming in front of the camera.

Several artificial intelligence techniques that exploit convolutional neural networks (CNNs) have been proposed for performing VO/VSLAM (Li et al., 2018; Mohanty et al., 2016; Tateno et al., 2017). Also, the direct fusion of inertia measurement unit (IMU) measurements has been demonstrated to be possible and effective (Clark et al., 2017). These methods have the advantage of not requiring any knowledge about the camera and can provide a metric reconstruction with a monocular camera. Training CNNs in such systems do require a ground truth, but unfortunately, ground truth data for natural underwater environments are not available. Producing ground truth data for underwater positioning is very challenging and in the case of real image sequences in general of lower quality compared to what can be achieved with in-air motion capture systems or real-time positioning (RTK-GPS) solutions. However, synthetic underwater data sets (Zwilgmeyer et al., 2021; Hodne et al., 2022) exist.

1.1.2 Underwater specific

Due to various challenges present in underwater vision, many underwater VO/VSLAM and photogrammetry-specific solutions have been proposed that make use of a stereo system (Salvi et al., 2008; Swirski et al., 2010; Wirth et al., 2013; Bonin-Font et al., 2015; Leonardi et al., 2017; Johnson-Roberson et al., 2017; Gracias et al., 2017) whereas monocular system solutions are rarely available. However, reduced systems, like visual pose estimation approaches in underwater scenarios do exist (Duda et al., 2015; Schellewald et al., 2021).

Recently, a monocular geometry-based approach (Ferrera et al., 2019) that makes use of the Lucas-Kanade optical flow to identify keypoint matches has been published; while this approach requires low computational resources, it has its limitations regarding the tracking system. Because of its local formulation, keypoint matches are restricted to similar locations in pixel coordinates. Even if a camera with a high frame rate is used and constant illumination is possible, in natural underwater environments it is very likely that a series of images is highly blurred or obstructed, further limiting the ability to correctly match keypoints. Without a motion model, the tracking would be lost and a new initialization from scratch would be needed. In addition, (Ferrera et al., 2019) does not take into consideration loop detection and correction.

Underwater feature-based visual SLAM has been already of interest in (Aulinas et al., 2011), where a scale-invariant feature transform (SIFT)/speeded-up robust features (SURF)-based extractor is proposed, preceded by a non-uniform light correction and a normalization. The authors do not describe a visual SLAM method for underwater applications, but their focus lies on finding regions of interest (RoI) which can be utilized as landmarks for loop detection and correction. The idea of using only salient color RoI does limit strongly the capability of (Aulinas et al., 2011) to close loops as shown in our work. Features present on sand can be used to safely close loops, thanks to co-visibility keyframe analysis. Deep learning techniques have been investigated in order to detect and close loops in underwater applications, specifically, convolutional autoencoders (Leonardi & Stahl, 2018; Maldonado-Ramírez et al., 2019).

The use of monocular ORB-SLAM in an underwater environment has been investigated in (Hidalgo et al., 2018), in this work, the behavior of ORB-SLAM is analyzed with challenging sequences, captured with a monocular camera mounted on a remotely operated vehicle (ROV). The authors suggest enhancing ORB-SLAM with sub-mapping capabilities, which is also presented in this paper, with the addition of a motion model that updates the current pose (see Sect. 6). In ORB-SLAM keypoints detected in underwater scenes could belong to various elements, not suitable to be used for ego-motion tracking and mapping, for this reason, a deep learning-based approach has been developed to reject such keypoints (Leonardi et al., 2020).

ORB-SLAM has been found to be the best-performing visual SLAM by the authors of the Aqualoc dataset (Ferrera et al., 2018), justifying our goal of improving upon it for underwater monocular VSLAM applications.

2 System overview

In this section, a brief overview of the UVS system is given (a graphical overview of the system is shown in Fig. 2); details are explained in the following sections.

When a new frame is recorded, it is inserted into a first-in-first-out (FIFO) queue. In the initialization phase, a three-view initialization is attempted utilizing the frames in the queue. In the case initialization is successful, the system will continue processing this queue as long as the triangulation of new points is completed. This queue avoids frames to be dropped, trading the real-time constraint for tracking robustness. A discussion about this trade-off is presented in the partial synchronization between front-end and back-end Sect. 5. This queue runs on an independent thread and performs the ORB feature extraction and descriptor calculation. The ORB features are extracted homogeneously in the frame utilizing a grid in the same way as in ORB-SLAM.

Fig. 2
figure 2

Block diagram of the UVS system: in purple the system input, in green the different threads, in blue the map representation, in yellow the innovations to ORB-SLAM (Color figure online)

If the tracking is lost while frames are collected, a constant velocity motion model is applied to predict the current camera pose. When initialization is available the estimation process will restart from the last camera pose. Following the same criteria as present in ORB-SLAM ((Mur-Artal et al., 2015) section V, subsection B, D), several attempts to track the local map are performed, and evaluations are made to see if it is convenient to insert a new keyframe.

In case a new keyframe is inserted, new map points are created considering neighbor keyframes in the covisibility graph. The new map points are created by considering all the possible keypoint matches between keyframe pairs. Points that do pass chirality, epipolar, parallax, scale consistency, and reprojection tests are considered as high-quality points. These points are inserted into the map and are subject to the local bundle adjustment (BA) procedure.

At the end of the tracking, the station-keeping detection procedure runs on an asynchronous thread. The goal is to identify a situation where the robot is held stationary by a dynamic positioning system so the camera is looking constantly at the same scene. In such a situation the creation of new map points and keyframes is not necessary, and the computational resources can be allocated to perform map and pose optimization.

Two more procedures run concurrently: the loop closing thread and the pruning thread. The loop closing thread is in charge of detecting loop closure candidates and eventually performing loop fusion and essential graph optimization. The pruning thread is managing the memory resources, by pruning first map points which are further away, then eventually entire keyframes and all their observed map points.

3 Three-view initialization

Correct map initialization is a crucial step in VO/VSLAM feature-based algorithms, as the tracking starts from the initial map. In ORB-SLAM, initialization is performed by analyzing two views. The first relative pose is estimated with a homography or a fundamental matrix. A choice is made between these two models considering a score that is based on symmetric transfer errors (Hartley & Zisserman, 2003). The necessity of eventually using a homography to estimate the relative camera pose is needed by the fact that, for collinear points, the estimation of the fundamental matrix builds a set of possible solutions (Hartley & Zisserman, 2003). We see three problems with this solution:

  1. 1.

    This procedure selects the homography model on not perfectly planar surfaces, which leads to an approximate solution of unknown magnitude, considering the lack of scale information.

  2. 2.

    Visual SLAM, in general, is a three-view problem: estimating the relative scale requires scene overlap in three frames.

  3. 3.

    In a typical scenario, the underwater VSLAM system considers a camera pointing downwards, towards the seabed. This means that the main motion is a side-way motion, for which the estimation of a fundamental matrix with a 6-, 7-, and the 8-point algorithm is not recommended (Nistér, 2004).

For this reason, we propose a three-view initialization procedure with a unique model for both planar and non-planar scenes, based on the estimation of the essential matrix exploiting the 5-point algorithm from (Nistér, 2004) (it would be however preferable to use the iterative version of the 5-point algorithm (Lui & Drummond, 2013), due to potential ill-conditioning of the high degree polynomials involved in the original formulation of the 5-point algorithm). By calculating the essential matrix, using \(>2\) views, and knowing the intrinsic parameters, a unique solution can be calculated in every possible case (cf. (Nistér, 2004), Table 1 shows the degrees of ambiguity in the face of planar degeneracy for pose estimation).

For each frame, the keypoints and the descriptor are extracted, and only the frames where at least 100 keypoints can be detected are retained, as presented in ORB-SLAM (Mur-Artal et al., 2015).

A combination of random sample consensus (RANSAC) and chirality tests is used to find the unique essential matrices and the correctly triangulated corresponding keypoints. The optimal triangulation method is employed ((Hartley & Zisserman, 2003) Algorithm 12.1) to estimate the initial map. A further test (Longuet-Higgins, 1986) on the sum of a pair of two view parallax is performed to ensure the quality of the solution, as a high parallax limits the uncertainty on the initial map and poses. To limit the computational complexity, the first two views which would satisfy all the original initialization criteria as a fundamental matrix-based initialization in ORB-SLAM (Mur-Artal et al., 2015) are found. A map is created and a third view is searched in the pool of available frames to guarantee uniqueness, improve the 3D point estimate, and improve the chances that no initialization has to be performed soon again. The third view has to be subsequent to both the other two views. Immediately after initialization, a full BA is performed, refining poses and 3D points.

4 Complete keypoint set matching

When a new keyframe is inserted into the system by the mapping thread, new 3D points are triangulated using key-frames present in the covisibility graph. A simple and effective way to match keypoints is a brute force approach, where the descriptor associated with each keypoint is compared with other descriptors, utilizing the Hamming distance. A brute force matching guarantees that every pair of descriptors is compared, providing an exact solution to the matching problem, at the cost of having to perform roughly \(\mathcal {O}(n^2)\) comparison operations (assuming n keypoints in both frames subject to this operation). In ORB-SLAM brute force matching also happens, but is restricted to those features that belong to the same vocabulary tree, speeding up the search. While this operation does provide a sensible speed-up compared to a brute force, it limits the number of new keypoint candidates leading to a sparsity of the map and therefore reduces the tracking robustness. Vocabularies are sensitive to training, eventually contributing to an overall uneven SLAM performance through different datasets and sequences.

The goal is to avoid using the vocabulary, but still, be able to perform fast descriptor matching. A crucial fact is that the actual speed of the binary descriptor matching is very sensitive to how it is performed: naïve methods can be a hundred times slower than deeply optimized methods, which use proper data structures or exploit specific instruction sets. Specifically, utilizing 64 bits integers, unrolled loops, and \(X86\_64\) instruction \(\_popcnt64()\) it is possible to obtain a 3X speedup compared to the brute force binary descriptor matcher available in OpenCV (Bradski, 2000).

This descriptor matcher represents already a 60X speedup compared to a naïve implementation. All together this implementation is 180X faster than a naïve comparison. Considering the matching time of the distributed bag-of-word 2 (DBoW2) (Gálvez-López & Tardos, 2012) based ORB-SLAM match, it is just 30% slower on average (7.6 milliseconds against 10 milliseconds). A public implementation of this matcher is available (Avetisyan, 2020). As the matching procedure involves only a pair of keyframes at the time, this process can be easily parallelized by launching a thread for each pair of keyframes.

Fig. 3
figure 3

Three different SLAM estimates generated on sequence 00 of the KITTI dataset are presented. On the left, the map was generated by ORB-SLAM. In the center, the map generated by UVS, where a series of frames have been skipped to force the system to lose the tracking. The sections where this skipping has been performed, are the ones surrounded by red bounding boxes and enumerated. On the right, is the map generated by ORB-SLAM, where the same frame-skipping operation is performed. Due to the relative scale difference between the left/central and the right figure, corresponding sections are surrounded by a green bounding box (Color figure online)

5 Partial front/back-end synchronization

In ORB-SLAM back-end and front-end are running asynchronously. This allows the system to be as responsive as possible to new frames. Unfortunately, the back-end and the front-end are dependent on each other when it comes to estimating the next camera pose: if the back-end did not create enough map points in time, the tracking will be lost, regardless of the quality of visual information in the frames. ORB-SLAM is strictly dependent on the hardware configuration. In ORB-SLAM, the ability to estimate new camera poses depends on the ability of the back-end to estimate new map points, if the latter does not happen fast enough, the tracking could be lost. In UVS the front-end and back-end are partially synchronized. UVS does accept new frames only when a new keyframe insertion, and so the creation of new map points, has been completed. If frames arrive before the triangulation is completed, they are placed in the FIFO queue. Using this strategy the dependency from the hardware configuration is not eliminated, as a fast framerate can fill the FIFO queue, and so the SLAM estimates could be outdated, considering the current pose of the robot. The dependency of the hardware configuration is shifted from real-time performance towards tracking robustness and loop closure quantity and quality.

6 Motion model

ORB-SLAM uses a constant velocity motion model, see Eq. (1). This motion model is used only for tracking the local map given the current frame. The motion model is implemented as a differential frame-to-frame camera pose, by the mean of a homogeneous transformation matrix:

$$\begin{aligned} \begin{aligned} V_c^w&= X_c^w P_w^c \\ X_c^w&= V_c^w P_c^w \end{aligned} \end{aligned}$$
(1)

Here, \(P \in SE(3)\) represents the previous pose, \(X \in SE(3)\) the current pose, and \(V \in SE(3)\) the differential pose between the two, where SE stands for the special Euclidean group. While this functionality is preserved in our system, this motion model is used also for a different purpose. When the tracking is lost, such as in the case of a complete temporary occlusion (and a few keyframes additional to the initial ones are created), ORB-SLAM enters into a re-localization mode. In this mode, no re-initialization is possible, with the consequence, that even in presence of frames with which initialization, mapping, and tracking would be possible, no actions are performed and all the potential map and camera poses are lost (see Fig. 3). Our solution to this problem is to continuously update the current pose of the camera by using the previously described motion model, while a new initialization is attempted. If a new re-initialization is successful, then the previous map and graph information is retained, and the system restarts tracking and mapping. This process does not block the attempts of relocalization, instead, relocalization attempts continue, and if relocalization is successful, the motion model predictions are discarded. If the camera motion is in contradiction with the motion model, it is obvious that the new map estimate starts with a substantially wrong pose. It is worth mentioning that in case of loop closure, a global map can be created and a valid SLAM estimate can still be achieved, thanks to the loop alignment procedure and the essential graph optimization. In concrete situations of camera-only underwater navigation of autonomous underwater vehicles, this approach is likely to be very effective, as the motion is mostly unidirectional, with mild accelerations, making a constant velocity motion model a valid model.

7 Loop closure

Loop closure in ORB-SLAM is composed of three main parts: loop detection, loop validation/loop correction, and optimization. The loop detection procedure ensures that a loop is found in several consecutive keyframes. This procedure produces loop candidates which are on average seventy times more numerous than the number of corrected loops, both in in-air and underwater datasets. The loop validation proceeds to compute a similarity transform with the keyframes involved in the eventual loop closure. Using the found transformation, a geometric validation of the loop is performed, and if the validation is successful, the loop is accepted, and a loop correction can be initiated. In ORB-SLAM, loop detection and loop validation both rely on the bag-of-words (BoW) model, but in two different ways: loop detection uses BoW vectors to compute a similarity score between keyframes, while loop validation does use it to speed up feature matching. Speeding up the matching process by using DBoW2 decreases significantly the number of valid matches that can be obtained, and this can cause a loss of a perfectly valid loop closure. As the loop validation is run in the loop closure thread, the small extra computational effort required to perform a complete keypoint set matching is not impacting directly the tracking and mapping thread, considering the method described in Sect. 4.

One could expect that re-training the BoW vocabulary would lead to higher quality and quantity of loop corrections, but our experiments have only confirmed this hypothesis to a minor degree. A 1 million-word vocabulary trained with underwater images (Scott Reef 25 and Tasmania O’Hara 7 (for Field Robotics, 2009)) shows no improvements in loop detection and correction on the Aqualoc sequences compared to the vocabulary supplied with ORB-SLAM. A vocabulary trained with the same settings on the Aqualoc dataset does show improvements compared to the one supplied with ORB-SLAM. It is able to produce only 60% of the correct matches produced by the UVS system.

8 Station-keeping detection

A typical operation in marine robotics is station-keeping. During this operation, the robot tries to keep its position stationary relative to an object or to the environment. It is popular to perform this operation using vision sensors (Marks et al., 1994; Cufí et al., 2002; Kuo et al., 2017). As the scene is almost static, many visual SLAM operations, like keyframe insertion and new map point triangulation, are not required. Ideally, it would be beneficial for SLAM systems to receive a message from the control system about the initialization and termination of the station-keeping operation, keeping in mind that there could be a delay when the station-keeping is initiated and when the visual scene is actually static. Furthermore, it is not always practical to interface the control system with the SLAM system. Therefore, we propose to detect an ongoing station-keeping operation (a situation where tracking is successful and no relative motion can be estimated from images for a certain time period), and use this information to allocate CPU time to perform global BA, improving the poses and the map.

We propose an approach to detect station-keeping in absence of absolute scale information, looking at three indicators:

  • the ratio r of commonly observed map points Obs between an initial frame \(f_1\) and the following frames, see Eq. (2)

  • the Mahalanobis distance \(d_{\vec {\theta }}\) over the attitude derivatives \(\frac{\textrm{d}{\vec {\theta }}}{\textrm{d}t}\), see Eq. (3)

  • the average angle \(\alpha _{avg}\) between velocity vectors \(\vec {v}\), see Eq. (4)

The n frames \(f_i, i = 1,...,n\) taken into consideration define the station-keeping detection window. The station-keeping detection window size n is user-defined and depends on the desired sensibility goal. As a general indication, it is advisable to set n to 15 times the fps, so that false detections are avoided. Station-keeping is detected when these indicators verify a series of conditions, see Eq. (5). Here, follows a detailed description of these indicators: the ratio r of commonly observed map points is given by

$$\begin{aligned} r = \frac{ (Obs_{f_1} \cap Obs_{f_2} \cap ... \cap Obs_{f_n})}{ \textrm{card}(Obs_{f_1})},\ \ \ \ \ r \le 1, \end{aligned}$$
(2)

where \(Obs_{f_i}\) represents the set of map points observed by frame \(f_i, i=1,...,n\) as a set of map point IDs and \(\textrm{card}(Obs_{f_1})\) represents the number of observed map points by the frame \(f_1\). As a consequence, the nominator of Eq. (2) represents the intersection set, consisting of those map points with distinct IDs which can be identified in all the station-keeping window frames. In an ideal station-keeping situation, all the frames in consideration would observe the same map points, and the ratio would be equal to one. The Mahalanobis distance \(d_{\vec {\theta }}\) of the estimated Gaussian distribution over the attitude derivatives, calculated as frame-to-frame attitude differential and expressed as Euler angles, here represented as

$$\begin{aligned} \begin{aligned} d_{\vec {\theta }}&= \sqrt{ (\vec {0}-\vec {\mu }_{MLE})^T {{\Sigma }^{-1}_{MLE}} (\vec {0}-\vec {\mu }_{MLE}) }, \end{aligned} \end{aligned}$$
(3)

it should be equal to 0, as we expect a robot that is performing station-keeping to be stationary without performing any rotational motion. The Gaussian distribution \(\mathcal {N}(\vec {\mu }, \Sigma )\) is calculated using a maximum likelihood estimator (MLE):

$$\begin{aligned} \begin{aligned} \vec {\mu }_{MLE}&= \mathop {{{\,\mathrm{\arg \max }\,}}}\limits _{\vec {\mu }}\ \mathcal {N}\left( \left[ \frac{\textrm{d}{\vec {\theta }}}{\textrm{d}t}w_a\right] \ |\ \vec {\mu }\, \Sigma \right) , \\ \Sigma _{MLE}&= \mathop {{{\,\mathrm{\arg \max }\,}}}\limits _\Sigma \ \mathcal {N}\left( \left[ \frac{\textrm{d}{\vec {\theta }}}{\textrm{d}t}w_a\right] \ |\ \vec {\mu }\, \Sigma \right) , \end{aligned} \end{aligned}$$

where \(w_a = \frac{1}{3}\) represents a normalization factor, expressed in degrees. A smaller \(w_a\) implies that more rotation is tolerated when it comes to detecting station-keeping. The last indicator is the average angle between velocity vectors, inside the station-keeping detection window:

$$\begin{aligned} \begin{aligned} \alpha _{avg}&= \frac{1}{n-1}\sum ^{n-1}_{i=1}{\arccos { \left( \frac{ \vec {v}_1 \cdot \vec {v}_{i+1} }{\left\Vert \vec {v}_1\right\Vert \left\Vert \vec {v}_{i+1}\right\Vert } \right) } } \end{aligned} \end{aligned}$$
(4)

where \(\vec {v}_i\), \(i=1,...,n-1\) is the velocity vector associated with the frame \(f_i\).

During the robot’s forward or backward movement, all of these velocity vectors would point roughly in the same direction and, as a consequence \(\alpha _{avg}\) should be close to zero. When the robot is turning, \(\alpha _{avg}\) should be higher than a few degrees, note the attitude is also changing. In an ideal station-keeping situation, \(\alpha _{avg}\) would have a high value, as a stationary robot would in theory produce uniformly distributed velocity vectors, and attitude changes would be small. Taking into consideration \(\alpha _{avg}\) together with r, the ratio of commonly observed map points and \(d_{\vec {\theta }}\), an indicator of the attitude derivatives, is important, since the robot when moving through a large landscape with low velocity, could result in small attitude changes and several frames could observe the same map points. In such a case, the evaluation of \(\alpha _{avg}\) allows distinguishing a moving robot from a stationary one.

Finally, the three conditions which have to be satisfied for the station-keeping to be detected are:

$$\begin{aligned} {\left\{ \begin{array}{ll} r> 0.9 \\ d_{\vec {\theta }} < 1\\ \alpha _{avg} > 45^\circ \end{array}\right. } \end{aligned}$$
(5)

As station-keeping is detected, no new keyframes are created and global BA starts. When the number of tracked points in the last frame goes under 50, then the global BA is aborted, and tracking and mapping continue as normal. Furthermore, as the station-keeping ends, n frames have to pass (where n represents the station-keeping window) before station-keeping can be detected again.

9 Pruning

As an underwater robot is exploring an open environment, the data generated by a visual SLAM system is growing unbounded. All this data is naturally residing in the system memory, which is limited and crucial for the operative system and other programs. It follows that a pruning procedure is obligatory in a system that claims to be able to perform life-long operations. Pruning not only bounds the system memory requirements but also bounds the computational complexity of global operations, like loop correction and global bundle adjustment. The pruning strategy presented in this paper introduces a new data structure: the partially pruned keyframes. A partially pruned keyframe is a keyframe that observes only a fraction of the map points that it was observed before the partial pruning operation. A central operation in visual SLAM is loop closure. The pruning strategy that is going to be described tries to keep re-localization and loop correction possible on partially pruned keyframes. Partial pruned keyframes allow to lower the memory footprint of the UVS system, and at the same time, allow to keep vital information for loop closure and path planning. The map points are pruned in such a way that the remaining points are uniformly distributed, in this way we keep valid information for the same region, where these map points are present for future endeavors.

In addition, surface points uniformly sampled do maximize the probability of a correct terrain reconstruction, within the limits of the Nyquist-Shannon sampling theorem.

The partial pruning procedure retains 33% of the points originally observed by a keyframe. When a keyframe gets partially pruned, a Boolean flag is set and the covisibility graph weights are updated accordingly. As soon the first pruning is initiated, the loop detection thresholds are lowered by 66%, and the loop verification procedure utilizes the per-keyframe Boolean’s flag to take into consideration the lower amount of observed map points.

The utilized system memory is checked with a system call when a new keyframe is inserted. No pruning is performed until the utilized memory reaches 90% of the set threshold. All the pruning operations start from the keyframe which is further away from the active keyframe and proceed following the creation order through the essential graph. When the pruning has partially pruned all the keyframes that do not belong to the local map, then keyframes are pruned entirely and the covisibility graph is updated.

10 Results

In this section we present a comprehensive analysis comparing our modifications of the monocular ORB-SLAM(2) to the original work, using in-air and underwater datasets. Analysis of underwater datasets includes also the state-of-the-art direct VSLAM LDSO (Gao et al., 2018), which adds loop closure capabilities to the original DSO (Engel et al., 2018). In addition, we will present a comprehensive analysis of each stated contribution, to highlight the strengths and weaknesses of the UVS system. Regarding benchmarking using public datasets, we will analyze several sequences from the KITTI dataset (Geiger et al., 2012), the TUM-RGB-D dataset (Sturm et al., 2012), the Aqualoc dataset (Ferrera et al., 2018) and the RTMVO dataset (Ferrera et al., 2019). Two additional private datasets were used to further verify and highlight the pros and cons of the contributions proposed in this paper. Every single experiment was performed 5 times, and the median value was reported. Reporting median values over several experiments is necessary, as ORB-SLAM, as well as UVS, contains several stochastic mechanisms so that results cannot be exactly repeated. Examples of these stochastic mechanisms are the BRIEF descriptor and the non-convex optimization.

Here is a list and a brief description of the restricted datasets:

  1. 1.

    Herkules Relict Front-Camera: A 30 fps, and 1080p video from the front camera used to manually guide an ROV is supplied. The sequence starts by looking at the empty sea, then it meets the relict of a fishing ship. As the ROV is positioned so that the front camera is facing a side of the ship (Leonardi et al., 2017), the ROV activates dynamic positioning. This sequence is used to demonstrate the ability of the UVS system to detect the station-keeping, using only visual information.

  2. 2.

    Down-looking camera on the NTNU light autonomous underwater vehicle (LAUV) “Fridtjof” from OceanScan-MST registered at Kjerringholmen North, Norway: The sequence is captured approximately at 4 fps at a resolution of 1376x1032 pixels. In this sequence, a LAUV slowly dives with a small pitch angle towards the seabed and then keeps a constant distance to it with a constant velocity, for a certain amount of time. It shows at the start of the sequence that the seabed is mostly composed of rocks, but becomes slowly sandy.

    This sequenced is used to demonstrate the ability of the UVS system to perform VSLAM in the occurrence of low visibility and low frame-to-frame scene overlap (on average, every 10 frames the system is presented with a completely new scene). Natural and artificial illumination of the scene is very poor, as the seabed is located at a depth between 40 and 60 ms (Norwegian waters) for which the illumination system of the LAUV is not powerful enough. A per-frame “ground-truth” is provided, thanks to the LAUV navigation system.

10.1 Overall performance analysis

In this subsection, we analyze several key performance indicators. The results are presented separately for in-air datasets and underwater datasets, see Table1 and Table 2, respectively. All tests have been performed on an i7 5960X @ 3.0Ghz platform with 32Gb of DDR4 and 2133 Mhz. Regarding the trajectories, the following operations are performed before calculating the median keyframe RMSE:

  1. 1.

    KITTI: 7 degree of freedom (DoF) alignment with the method of (Umeyama, 1991).

  2. 2.

    TUM RGB-D: Unfortunately, the ground-truth data gi-ven is not as friendly as the KITTI, as the ground-truth data provided does not match exactly the frames in terms of timestamps. To compare the estimated trajectories, each keyframe is compared to an element in the ground truth, considering the closer timestamp. The trajectories are then aligned with the 7 DoF method of (Umeyama, 1991).

  3. 3.

    Aqualoc: Ground truth is provided for every 5th frame. The solution here proposed is to linearly interpolate these poses to obtain per-frame ground truth information. After that, the trajectories are aligned with the 7 DoF method of (Umeyama, 1991).

  4. 4.

    RTMVO: Same setting as the Aqualoc dataset.

Together with the raw data, we summarize the data in terms of gain/loss percentage moving from ORB-SLAM to UVS. This summarized data is composed of sequences where both ORB-SLAM and UVS did initialize within the first 5% of images of a sequence. In all the tables, positive changes are highlighted in bold italics, neutral changes are in bold, and negative changes are highlighted in italics.

For each table column, the argumentation is as follows:

  1. 1.

    The decrease in the median keyframe RMSE is positive, as the UVS system did generate a trajectory closer to the ground truth.

  2. 2.

    The increase in the number of map points is positive, as it can potentially produce a higher amount of loop closures and positively impact robustness.

  3. 3.

    The decrease/increase in the number of detected loop closures is neutral, as it is arguable that a higher number of detected loop closures could lead to a higher number of corrected loop closures. In Table 1 it is shown that the number of detected loop closures is still over two orders of magnitude higher than the number of corrected loop closures and UVS always corrects an equal or higher number of loops. Thus, the higher number of detected loops in ORB-SLAM are loops that do not pass the geometric verification.

  4. 4.

    The decrease in the number of keyframes is positive since the UVS system is in general able to initialize earlier than ORB-SLAM and creates on average a higher amount of key-frames (the reader should compare the map point/key-frames metric, which shows how a brute force match increases the number of 3D points and so tracking robustness). For this reason, this metric is marked as neutral.

  5. 5.

    The increase in the map point/keyframe metric is positive, as it indicates that the keyframe insertion procedure is more effective at creating map points, lowering the chances of losing track in the incoming frame.

Note, all numbers listed depend directly on when the initialization occurs. In this case, one algorithm initializes at the start of a sequence and another initializes in the middle of a sequence, the resulting numbers cannot be used for a direct comparison. For this reason, all the results presented in this paper are marked red between parentheses in the mean RMSE column (initialization occurs more than 5% of the total number of frames later). An example is the fr1_xyz sequence, where ORB-SLAM initializes roughly in the middle of the sequence.

The in-air results are presented in Table 1. UVS performs better with respect to ORB-SLAM in terms of the median RMSE and does exceptionally well regarding the KITTI 09 sequence, where it is able to close the loop consistently. In the KITTI dataset, the overall number of map points created by UVS is slightly inferior, since a much lower amount of keyframes is created by UVS (-18%).

An explanation of these results is that the number of keypoints matched using a brute force approach is higher and of higher quality, requiring fewer keyframes to be created.

Regarding the TUM-RGBD dataset, we first have to inform the reader that the number of keypoints extracted is 1000 for both systems, instead of the ordinary 2000, following the original ORB-SLAM configuration file for this dataset. In this indoor dataset, the DBoW2-based matching in ORB-SLAM seems to be very challenging, especially compared to its performance for the KITTI dataset. This can be observed in the number of map points/keyframes, as it is 12% higher utilizing UVS, which translates to a 3X improvement with respect to the results in the KITTI dataset. The sensible lower amount of keypoints translates into a more difficult initialization, which occurs in four sequences (cf. Table 1). A high correlation can also be observed between the number of map points and the median RMSE.

The underwater results in Table 2 show that UVS is able to outperform ORB-SLAM in almost every sequence, and all the summarized key indicators are positive. On average, UVS performs more than 2X better in terms of the median RMSE compared to ORB-SLAM. Especially good performances can be observed in the RTMVO dataset, where UVS is 3.5X closer to the ground truth compared to ORB-SLAM. The number of map points is also always higher, even when the overall number of keyframes is negative, as in the Aqualoc dataset. In several Aqualoc sequences, UVS is able to close more loops than ORB-SLAM. In the second sequence of the RTMVO dataset, ORB-SLAM loses tracking and re-localizes, caused most likely by the high density of small fish swimming in front of the camera. LDSO has been run without enforcing any real-time constraints and without using the “fast” option, with automatic “crop” enable, as suggested by the author. LDSO is not able to run on most of the Aqualoc sequences and performs poorly regarding the other sequences listed. LDSO is able to run on all the sequences of the RTMVO dataset like UVS but produces by far the poorest results in terms of the median RMSE. LDSO is able to produce a much denser representation compared to ORB-SLAM, but it isn’t able to run in real time. A direct method like the one described in DSO/LDSO shows the potential to be utilized as a densifier, beneficial for performing underwater robotic navigation and visual obstacle avoidance.

10.2 Three-view initialization

Similarly to ORB-SLAM, we analyze the initialization behavior for the sequence fr3_nostructure_texture_far.

Thanks to the three-view initialization, a unique solution is provided even for planar scenes. In this sequence, the UVS system initializes always correctly, around frame 19.

10.3 Motion model

We propose the following experiments regarding the analysis of the efficiency and the limitation of the motion model:

  • Case 1: The motion model is predicting correctly the motion of the camera.

  • Case 2: The motion model is predicting wrongly the scale of the movement.

  • Case 3: The motion model prediction is not correct, in terms of scale, translation, and rotation.

All the experiments were conducted on the KITTI 00 sequence, as it provides a simple and familiar way to visualize the VSLAM estimate. The results are present in Table 3.

Regarding Case 1, we replace the original frames of the sequence (see listing below) with fully black frames, we call this operation obfuscation. This is performed for three different parts of the sequence:

  • 20 frames starting from frame 8

  • 20 frames starting from frame 130

  • 20 frames starting from frame 252

These three different sub-sequences have been selected as the motion of the camera in these sub-sequences can be approximated by a linear uniform motion, and so the motion model predicts correctly.

Regarding Case 2, we perform a similar operation to the one presented in Case 1, with the only difference that instead of obfuscating the original frames, we skip these frames, we call this operation disruption. By doing so, we simulate a situation where the tracking is lost and the motion model underestimates the scale of the movement several times during the initial part of the sequence and therefore putting at risk the entire SLAM estimate. Nevertheless, the median RMSE is only marginally higher (cf. Table 3). While the UVS system is able to recover and produce maps and poses of nearly identical quality from the one generated by ORB-SLAM on the original sequence, ORB-SLAM produces less than 5% of the map and the poses compared to UVS. A visual representation of the estimates is shown in Fig. 3.

Regarding Case 3, we perform the disruption of 120 frames starting from frame 8, this removes from the sequence the first right turn. The right turn involves also a change in the speed of the car, which results in a severely mispredicted position, orientation, and scale. This disruption is too severe for the system to recover: as soon as a loop closure occurs that involves the missing turn, the essential graph optimization induces that parts of the map grow in scale, destructing the entire estimate.

A similar test is set up on sequence 5 of the Aqualoc dataset. In this sequence, we take into consideration the frames from frame 900 to frame 940. Similarly to the frames in the KITTI dataset, in these frames, the camera movements reassemble a rectilinear uniform motion, and so the motion model predictions are close to the ground truth.

Table 1 Key performance indicators of UVS and ORB-SLAM on in-air sequences from the KITTI (outdoor) and the TUM-RGBD (indoor) dataset
Table 2 Key performance indicators of UVS and ORB-SLAM on underwater sequences from the Aqualoc and the RTMVO dataset

Similarly to the cases previously described, several cases are analyzed:

  • Case 4: The frames are obfuscated.

  • Case 5: The frames are disrupted.

  • Case 6: In addition to the first set of frames, another set of frames is considered for obfuscation, the frames between 2900 and 2940.

  • Case 7: In addition to the first set of frames, another set of frames is considered for disruption, the frames between 2900 and 2940.

In Case 4 and Case 5, the loop closure enables a valid map and trajectory, without any tangible effects on the accuracy. Cases 6 and 7 show a limitation of this system: the tracking is lost twice, resulting in two disconnected maps. The consequence is that, if a loop closure occurs, only one of the two maps benefits from it. A visual representation of these cases can be observed in Fig. 4.

Fig. 4
figure 4

Several UVS estimates of the Aqualoc 05 dataset: On the top is the XY view of the normal estimate of UVS, in the middle is the XY view (Case 4 and 5), on the bottom is the XZ view (Case 6 and 7)

Table 3 Results from different kind of obfuscations and disruptions of the KITTI sequence 00 showing the median RMSE

10.4 Keypoint matching

In the overall performance, we observed how a complete set of keypoint matchings allows the production of denser maps and therefore lowers the chances of tracking loss. A positive correlation between the number of map points and the overall accuracy is found. In Table 4 we observe that the matching using a brute force approach is only 0.2X slower in terms of CPU time than the matching based on DBoW2 as in ORB-SLAM.

Table 4 A comparison between UVS (Full) and ORB-SLAM (DBoW2) on the procedure to match keypoint descriptors, in terms of running time and amount of valid matches

10.5 Partial synchronization

Partial synchronization was not required for the in-air and underwater datasets. Its effectiveness comes into play when frames arrive in the system at a higher rate than the ones at which tracking and keyframe insertion can occur.

Simplifying, this situation happens when the frame rate is too high. An analysis of how the increase in fps affects the system is present in Table 5. Even if the UVS system is utilizing a 30% slower on-average keypoint matching procedure, it is able to clearly outperform ORB-SLAM when the framerate is increased, both in terms of median RMSE and in terms of the ability to complete a sequence without losing the camera localization. When using a thread for each set of features to match, the framerate can be increased even more, without a significant decrease in accuracy.

Table 5 This table shows how the partial synchronization in UVS allows it to run at a higher framerate than ORB-SLAM on the same hardware platform, with a relatively small cost in terms of accuracy

10.6 Station-keeping detection

Table 6 shows the in-air performance of the station-keeping detection. The proposed set of conditions (see Eq. 5) is able to detect when the car stops at crossroads. A nice plot of the evolution of the three station-keeping indicators in the KITTI 00 sequence is present in Fig. 5.

Unfortunately, neither the Aqualoc nor the RTMVO dataset contains any sequence which does reassemble a station-keeping situation, and so we resort to our previously described “Herkules Relict Front-Camera” sequence. In this sequence, station-keeping is detected, as the ROV is kept stationary by an operator and the camera is looking at the ship. In addition, the station-keeping is interrupted multiple times. The proposed station-keeping detection is able to detect the absence of motion and a persistent stationary state. At the same time, this allows the system to continue its operation as soon as the scene starts to evolve, without loss of tracking.

Table 6 Scale-agnostic detection of a pseudo-stationary situation in the KITTI dataset
Fig. 5
figure 5

Station-keeping indicators on the sequence 00 of the KITTI dataset, obtained with a station-keeping detection with \(n=10\). Station-keeping is detected when the car briefly stops at a crossroad. \(d_\theta =10\) to improve the visualization

10.7 Pruning

To evaluate the effectiveness of the pruning we performed two different experiments. The first experiment aims to evaluate the impact of partially pruned keyframes on loop closing. To achieve this goal we avoid performing full key-frame pruning, but partial pruning is always active (memory target of 0 Megabytes). The second experiment aims to simulate a realistic situation, where pruning starts in the middle of the sequence.

To analyze the effects of the pruning, we present to the reader a table with the main indicators, Table 7, using one in-air and one underwater sequence. In addition, an exceptional case is presented, where instead of the previously described 66% map point pruning, the percentage is raised to 90%, see Fig. 6. While the number of map points decreases accordingly, no significant disturbances can be observed on the median keyframes RMSE, especially in the underwater dataset, where, thanks to the high scene overlap of the frames, no decrease in accuracy is observed. Due to the adjustments on the loop validation of partially pruned keyframes, no loops are missed.

Table 7 Effects of map point pruning with different settings on the median RMSE, the global amount of map points and corrected loop closures
Fig. 6
figure 6

Left: Original SLAM estimate produced by the ORB-SLAM system. Center: UVS estimate, where a pruning of 66% of the map points is performed after half of the sequence has been processed by the system. No significant difference can be observed. Right: SLAM estimate produced by the UVS system, when partial keyframe pruning is set to 90%, and the memory target is set to 0 megabytes. The loop closures occur in the same quantity and at the same position compared to ORB-SLAM, even with 66% fewer map points. However, the top-right part of the map is incorrectly rotated: Sparsifying an already sparse representation can lower the quality of the final estimates. Note, that this is a particular case, as many runs with continuous 90% pruning did not contain any trajectory alternation

Fig. 7
figure 7

Visual representation of the sequence Kjerringholmen North. A sub-sequence is selected, representing the typical, mostly rectilinear path of LAUV

10.8 Kjerringholmen North dataset

The Kjerringholmen North is a different dataset with regards to the previously presented ones. It is recorded by an autonomous underwater vehicle instead of a manually controlled ROV, like the Aqualoc or the RTMVO sequences. As previously mentioned, most of the trajectories of the LAUV are rectilinear, see Fig. 7. While ORB-SLAM does not initialize at all, LDSO initializes, but fails in the state estimate, see Table 8. In underwater imaging applications, it is common to use contrast limited adaptive histogram equalization (CLAHE) to improve the quality of underwater images, especially in low-light conditions.

The Kjerringholmen North dataset contains images recorded under poor light conditions. Therefore, this sequence represents a good example to analyze the different settings of the CLAHE.

The CLAHE used in Table 8 is provided by OpenCV, where the first element represents the clip limit and the second one the tile size. The table presented provides evidence of how a correct CLAHE setting could further improve performance.

Table 8 Indicators of SLAM estimates for the Kjerringholmen North dataset

11 Conclusion and further work

11.1 Conclusion

We presented a series of modifications to the monocular ORB-SLAM system, in order to achieve a better performance regarding underwater environments in terms of robustness, loss of tracking, and lifelong operations. As this series of modifications and extensions is quite extensive, we defined a new term for the system: Underwater Visual SLAM (UVS). Several tests have been performed to show how these modifications impact the performance of ORB-SLAM, utilizing publicly available datasets, both in-air and underwater. Our system outperforms both ORB-SLAM and LDSO in underwater environments, both in terms of the median RMSE and in terms of the ability to close loops. Compared to ORB-SLAM and LDSO, UVS is able to operate in poorer light conditions, with lower scene overlap between frames and also it tolerates loss of tracking, all without the need for any image pre-processing. Several generic issues with ORB-SLAM have been addressed, leading to a system that performs better even in-air.

11.2 Further work

Systems that are dependent on hardware resources, like every VSLAM system available today, could benefit from a benchmark dataset that aims to determine the maximum framerate for a particular hardware configuration.

In addition, integer overflow must also be addressed. New points and new keyframes do use an integer-based unique ID. During the execution of VSLAM, this integer will eventually overflow and so a proper strategy must be implemented to avoid this problem.

While the integration of ORB-SLAM with global navigation satellite system (GNSS) and IMU has been already investigated (Mur-Artal & Tardós, 2017b; Kiss-Illés et al., 2019), it would be interesting to investigate sensor fusion for ORB-SLAM regarding underwater-specific sensors, like a Doppler velocity log (DVL).

Most of the underwater cameras have a flat port enclosure integrated which represents a substantial contribution to the underwater 3D structure estimate and the relative camera motion. A pinhole camera model correction map generator for underwater imaging, for cameras with flat port enclosures, has been recently presented (Łuczyński et al., 2017). Preliminary experiments regarding the correction model show that the water refraction index and the distance of the focal point of the camera from the flat port are very difficult to estimate. The model can be effectively calibrated by using a checkerboard placed underwater and performing a grid search on these parameters by minimizing the re-projection error.