1 Introduction

Simultaneous Localization and Mapping (SLAM) is a fundamental task in robotics that allows an agent to build a map of an unknown environment while, at the same time, tracks its position within this map (Cadena et al. 2016). SLAM systems typically rely on Loop Closure Detection (LCD) methods to identify when the robot has returned to a previously visited place (Stewart et al. 2002). As it is well known, the accumulated drift can be greatly reduced using loop closure information. Additionally, LCD is also used for robot relocalization after a tracking failure (Williams et al. 2011). Cameras have been fundamentally used in the last decade, among other sensors that can be adopted for LCD, due to the richness of imagery as well as the low cost of these devices nowadays. These methods are globally referred to as appearance-based LCD (Angeli et al. 2008; Milford and Wyeth 2012; Cummins and Newman 2011; Galvez-López and Tardos 2012; Khan and Wollherr 2015; Garcia-Fidalgo and Ortiz 2018; Bampis et al. 2018).

The performance of appearance-based LCD methods is mainly determined by: (1) the image description technique adopted; and (2) the ability of the approach to efficiently index previous images (Garcia-Fidalgo and Ortiz 2015). Concerning image description, hand-crafted point features have extensively been used for LCD tasks. Among these solutions, binary descriptors (Calonder et al. 2010; Rublee et al. 2011) have progressively emerged as an alternative to real-valued descriptors (Lowe 2004; Bay et al. 2006) given their demonstrated advantages in computational terms. More recently, the aforementioned techniques have successfully been replaced by methods based on Convolutional Neural Networks (CNN), which have proven to be more robust to illumination and viewpoint changes than hand-crafted features (DeTone et al. 2018). CNN-based methods have typically been used as holistic descriptors. Other approaches (DeTone et al. 2018; Dusmanu et al. 2019; Revaud et al. 2019) have been recently shown to be able to detect and describe simultaneously interest points in an image with adequate performance; none of them, however, produce binary descriptors. On the other side, while hand-crafted point features have shown impressive results working on well-textured scenarios (Mur-Artal and Tardós 2017), their performance decreases when dealing with low-textured scenarios, where the number of feature points detected results to be rather low (Pumarola et al. 2017). Precisely because of their nature, low-textured, human-made environments usually exhibit structural regularities that can be described using line features. All this has taken us to the combination of both visual perspectives, lines and point features, in order to enhance loop closing performance and widen the range of environments that can be addressed (Company-Corcoles et al. 2020). By way of illustration, Fig. 1 shows the benefits of combining both kinds of features in a low-textured scenario.

Fig. 1
figure 1

Lines can support point features in low-textured scenarios: (left) query image, (right) loop-closing candidate image selected by our approach. Points and line segments are labelled as circles and lines, respectively. Matchings found are indicated by the same color in both images

In reference to image indexing, arguably the most used scheme is the combination of the Bag of Words model (BoW) [50], Nister and Stewenius (2006) with an inverted file. In this model, descriptors are quantized into visual words, according to the available visual vocabulary. Images are then described by a histogram of occurrences of each visual word in the image. Depending on how this visual vocabulary is generated, the available solutions can be classified as either off-line or on-line (Garcia-Fidalgo and Ortiz 2015). The off-line approach needs a training phase to build the visual vocabulary, which can be particularly time-consuming, and besides makes the resulting solution highly dependent on the diversity of the training set (Nister and Stewenius 2006). As an alternative, the on-line approach builds the visual vocabulary incrementally, as the robot navigates, which helps to solve the aforementioned issues. A fusion strategy is additionally required to obtain a consolidated list of loop closure candidates when different sorts of features are involved, maybe by means of several BoW instances, one for each kind of feature (Zuo et al. 2017; Gomez-Ojeda et al. 2019; Han et al. 2021).

Standard BoW schemes do not account for the spatial distribution of features in the image, which tends to reduce their accuracy under severe perceptual aliasing conditions (Galvez-López and Tardos 2012; Garcia-Fidalgo and Ortiz 2018). Due to this reason, a final spatial verification step is usually performed to check consistency at the geometric level for the resulting correspondences between the query and candidate images. In this regard, the most popular approach consists in using RANdom SAmple Consensus (RANSAC) (Fischler and Bolles 1981) to check whether the image features obey a specific motion model between images. Typically, this loop closing validation step degrades in performance when dealing with either non-rigid image transformations or a high number of outliers. To tackle this problem, some recent approaches check the consistency between images for local neighborhood structures around the correspondences found (Lowry and Andreasson 2018; Ma et al. 2019; Zheng and Doermann 2006; Bian et al. 2017).

Under this context, we introduce an appearance-based loop closure detection algorithm, named LInes and POints for Low-Textured scenarios’ Loop Closure Detection (LiPo-LCD\(^{++}\)). In our approach, binarized CNN-based point descriptors and binary line descriptors are indexed using an incremental Bag-of-Binary-Words scheme and combined through a query-adaptive late fusion strategy in order to obtain an integrated list of loop closure candidates. After choosing an image as a final loop candidate, the loop closure is further validated performing a spatial verification procedure which naturally integrates point and line features. Our approach, as it is shown in the experimental results section, outperforms other solutions for generic environments, exhibiting a remarkable performance level for low-textured scenarios.

A preliminary version of LiPo-LCD\(^{++}\) was introduced in Company-Corcoles et al. (2020). In this paper we extend this work with the following contributions:

  • A binarization procedure to transform a CNN-based real-valued local descriptor (DeTone et al. 2018) into a binary descriptor. This binarization procedure allows us to combine the advantages of CNN-based feature descriptors with the benefits of a binary visual vocabulary.

  • A query-adaptive late fusion approach to merge loop candidates retrieved using lines and points independently. This fusion strategy permits us to adapt our system dynamically to the environment by weighing automatically the candidates according to the images contents in terms of lines and point features, and, thus, increasing its performance for a wider range of scenarios.

  • A novel spatial verification method which combines appearance and geometric information of feature points and lines. This method leads to the detection of a larger number of inliers when dealing with lines and attains higher accuracy in general. Moreover, we propose a fusion of feature points and lines involving local neighborhood structures. This allows us to increase the efficiency of usual spatial verification stages for, particularly, low-textured environments.

Additionally, we report on an extensive set of experiments to validate the performance, adaptability and effectiveness of LiPo-LCD\(^{++}\). As it is shown in the experimental results section, our solution compares favourably with other state-of-the-art methods in the field.

The rest of the paper is organized as follows: Sect. 2 overviews the most important works in the field; Sect. 3 describes the proposed approach, while Sect. 4 discusses on a set of experimental results to evaluate LiPo-LCD\(^{++}\) performance; finally, Sect. 5 concludes the paper and suggests some future research lines.

2 Related work

Appearance-based loop closure approaches can be categorized according to their image description method (Garcia-Fidalgo and Ortiz 2015). In this regard, some works (Sünderhauf and Protzel 2011; Milford and Wyeth 2012; Arroyo et al. 2014) have opted for global descriptors for image description. These descriptors are usually very fast to compute, though, typically, they are more sensitive to viewpoint and illumination changes than local features. Due to these reasons, point local descriptors, either real-valued (Cummins and Newman 2008; Angeli et al. 2008; Cummins and Newman 2011; Tsintotas et al. 2019) or binary (Galvez-López and Tardos 2012; Mur-Artal and Tardós 2014; Khan and Wollherr 2015; Garcia-Fidalgo and Ortiz 2018), have been widely used in the literature during last decades. More recently, approaches based on CNNs have emerged as an alternative, motivated by their demonstrated robustness to visual appearance changes (Chen et al. 2014; Sünderhauf et al. 2015; Chen et al. 2017; Kenshimov et al. 2017; Lopez-Antequera et al. 2017; Yue et al. 2019). These solutions typically employ CNNs to extract a global descriptor of the image, what makes more difficult the implementation of a spatial verification stage.

The BoW model [50], Nister and Stewenius (2006), used along with an inverted file, is the most used indexing scheme for appearance-based LCD (Garcia-Fidalgo and Ortiz 2015; Lowry et al. 2016). As discussed earlier, depending on how the visual vocabulary is generated, LCD approaches are classified as either off-line or on-line methods. Off-line methods (Cummins and Newman 2011; Galvez-López and Tardos 2012; Mur-Artal and Tardós 2014; Bampis et al. 2018) generate the vocabulary during a training phase. Conversely, on-line methods (Angeli et al. 2008; Labbé and Michaud 2013; Khan and Wollherr 2015; Garcia-Fidalgo and Ortiz 2018; Tsintotas et al. 2018; Tsintotas et al. 2019) build the visual vocabulary as images are received.

Sequence-based matching is a well-known technique that has proven to be useful in LCD approaches (Milford and Wyeth 2012). Recently, Garg and Milford (Garg and Milford 2021) introduced the concept of temporal matching in a hierarchical scheme. In their solution, a learning-based sequence descriptor is first employed to select a number of place candidates. A learnt single image descriptor is next used to decide the final match. Unlike these works, our approach employs two types of features and involves temporal information in the decision after selecting a set of candidates. Additionally, we do not rely on global descriptors.

Fig. 2
figure 2

General overview of the proposed approach

Within the context of LCD, binarizing CNN features has been adopted by multiple solutions. For instance, Arroyo et al. (2016) applies min-max normalization to the vector resulting from the concatenation of several convolutional layers. The final descriptor is next obtained by randomly selecting a specific set of features and thresholding on each component. More recently, Garg and Milford (Garg and Milford 2020) describe a coarse quantization-based hashing scheme, which employs PCA for dimensionality reduction. In Neubert et al. (2019), the authors propose a sparsified binary adaptation of Locality-Sensitive Hashing (LSH) using random projections, while other works opt for a CNN-based approach to directly generate a binary descriptor (Lin et al. 2016; Lin et al. 2019). The main difference between our approach and these solutions is that they result into a global binary descriptor for the whole image. Furthermore, unlike our solution, some of the aforementioned works (Arroyo et al. 2016; Garg and Milford 2020; Neubert et al. 2019) require a dimensionality reduction stage.

Some SLAM approaches (Pumarola et al. 2017; Zhang et al. 2019) have benefited from the combination of points and lines for image description. However, during the LCD stage, they only rely on point features, discarding line information that can be useful for some environments, unlike our approach (Company-Corcoles et al. 2020), and others that adopt a dual scheme for LCD (Gomez-Ojeda et al. 2019; Zuo et al. 2017; Han et al. 2021). Our approach resembles the idea of combining lines and points described in these works, though it differs in multiple aspects, such as (1) the point features that are employed, (2) how the visual vocabularies are built, (3) how image scores are calculated from the visual vocabularies and later combined, and (4) the spatial verification process. All these points are revised in the following:

  • As for the generation of the vocabulary, all the above-mentioned works rely on an off-line method which, as commented previously, makes them highly dependent on the diversity of the training set used. This dependence does not appear in our method thanks to the on-line vocabulary acquisition approach.

  • Moreover, all of them make use of ORB (Rublee et al. 2011) to detect and describe point features. Conversely, we use a CNN-based method which is more tolerant to appearance changes.

  • Different weighing schemes have been proposed to merge lists of candidates. For example, Zuo et al. (2017) use static weights for each visual vocabulary, which are empirically adapted for each dataset. Since this does not consider the importance of each type of feature for every specific query image, Han et al. (2021); Gomez-Ojeda et al. (2019) propose dynamic weighing methods which can be adapted to each query image. More precisely, Han et al. (2021) computes weights from the entropy of the features found in the query image. Contrarily, Gomez-Ojeda et al. (2019) calculates the dynamic weights according to the fraction of features found, along with their distribution in the query image. Furthermore, dynamic weighing is combined with static weighing. Despite their good performance, these approaches evaluate the importance of each visual feature during feature extraction, which might not correspond to the actual importance, derived from the scores resulting from image candidates retrieval, as it is done in LiPo-LCD\(^{++}\).

  • Finally, regarding the final spatial verification step, Zuo et al. (2017); Gomez-Ojeda et al. (2019) require a previous pose estimation, as well as depth estimates for the features involved. Others, such as Han et al. (2021), adopt the most widely used spatial verification approach, based on a RANSAC-based estimator of the fundamental matrix. We instead adopt a strategy based on analyzing the preservation of the topological relationships between neighbouring features. Unlike these approaches, we only use the appearance to validate loop closures. Furthermore, the proposed procedure can tolerate a high number of outliers and non-rigid camera motions.

3 Proposed approach

Figure 2 illustrates LiPo-LCD\(^{++}\). As can be observed, two sets of features, i.e. lines and points, are found in every new image that is processed. Each of these features is described by means of a binary descriptor, which is then used to: (1) obtain a list of the most similar images according to this visual cue; and (2) update the corresponding visual vocabulary. Next, the two lists are merged using a scored-voting procedure, giving rise to a combined list of loop-closing candidates. Subsequently, a temporal filter, based on the concept of dynamic islands (Garcia-Fidalgo and Ortiz 2018), groups those images close in time and prevent adjacent frames from competing among them as loop candidates. To validate the existence of a loop closure, the representative image of the best island is selected as the final loop candidate image and it is spatially assessed against the query image. We deal next with the details of every step.

3.1 Image description

As stated previously, LiPo-LCD\(^{++}\) uses points and lines to describe images. The rationale behind this is that the combination of multiple and complementary descriptions improves the performance and robustness of LCD methods (Hausler and Milford 2020). By way of illustration, textureless scenes may be more distinctively described using lines than only points, while other environments that lack structure can be better described by means of point features. To this end, in our solution, the image \(I_t\) at time t is described by \(\phi (I_t)=\left\{ P_{t}, L_{t}\right\} \), being \(P_{t}\) a set of local keypoint descriptors and \(L_{t}\) a set of line-segment descriptors.

3.1.1 Point description

Fig. 3
figure 3

Feature points found using: (left) ORB, 46; (center) SIFT, 61; and (right) SuperPoint, 87

Given the robustness shown by CNN-based methods, we use the SuperPoint approach (DeTone et al. 2018) to obtain a set of keypoints from the image under consideration. SuperPoint is a learning-based feature detection and description framework that is specially effective under illumination and viewpoint changes. Furthermore, compared to similar solutions such as R2D2 (Revaud et al. 2019) and D2-Net (Dusmanu et al. 2019), SuperPoint detects and describes features in an affordable time. In LiPo-LCD\(^{++}\), we use the default model provided by the authors, which was trained using a wide range of scenarios. This is to enhance the performance of our approach in a larger range of environments, in contrast to other CNN-based approaches, which are typically trained using a specific scenario. Figure 3 shows an example of how SuperPoint detects, in an indoor scenario, a higher number of features than other hand-crafted features like ORB (Rublee et al. 2011) or SIFT (Lowe 2004). Besides, we have observed that SuperPoint tends to find keypoints in a more uniform way throughout the image than other detectors.

SuperPoint, similarly to other recent CNN-based approaches for keypoint detection, produces real-valued descriptors (DeTone et al. 2018; Dusmanu et al. 2019; Revaud et al. 2019). In this work, we are interested in binary descriptors to take advantage of their well-known computational benefits. To this end, our proposal inspires on other binary description methods (Calonder et al. 2010; Rublee et al. 2011), which perform pairwise comparisons between pixels within an image patch. In our case, we compare the values of different components of the real-valued descriptor. Formally, we arrange a D-dimensional real-valued descriptor \(d \in \mathbb {R}^\text {D}\) as a concatenation of 8-component M subvectors: \(d = [d_0, \ldots , d_{M-1}]\), with \(d_j = [d_{j,0},\ldots ,\) \(d_{j,7}], j \in \{0,M-1\}\). In this way, each 8-component vector leads to an 8-bit string for each pair, which can be stored using a single byte. Given two subvectors \(d_x\) and \(d_y\) from d, their corresponding 8-bit string \(\beta \) is hence defined as

$$\begin{aligned} \beta (d_x, d_y) = \sum _{0\,\le \,i\,\le \,7} 2^{i} \tau \left( d_{x,i}, d_{y,i}\right) \,, \end{aligned}$$

where \(\tau \) is a comparison test given by:

$$\begin{aligned} \tau \left( d_{x,i}, d_{y,i}\right) = \left\{ \begin{array}{ll} 0 &{} \text {if}\ d_{x,i} < d_{y,i} \\ 1 &{} \text {otherwise} \end{array}\right. . \end{aligned}$$

The final binary descriptor \(d_b\) for d results from the 8-bit strings of a chosen set of N pairs \((\wp _{j,1},\wp _{j,2}) \in \{0,\ldots ,M-1\}\times \{0,\ldots ,M-1\}\) from the M subvectors of d, as follows:

$$\begin{aligned} d_b = \bigoplus _{j = 1}^{N} \beta (d_{\wp _{j,1}}, d_{\wp _{j,2}})\,, \end{aligned}$$

where \(\oplus \) stands for the concatenation of binary descriptors. The binarization procedure is illustrated in Fig. 4.

Fig. 4
figure 4

Overview of the descriptor binarization procedure. The different colours denote the same position into the 8-component vector/8-bit string. Darker and lighter colours correspond to, respectively, real and binary values. In the drawing, \(\beta _j = \beta (d_{\wp _{j,1}}, d_{\wp _{j,2}})\)

SuperPoint generates a 256-dimensional descriptor, i.e. \(D = 256\), resulting into \(M = 32\) subvectors \(d_x\), so that the set of \(\wp _{j}\) pairs leads to a binary descriptor \(d_b\) of \(N \times 8\) bits. In this work, we assess three criteria for selecting these pairs, since the chosen pairs affect the performance of the LCD approach. For each case, we generate \(N = 32\) and \(N = 64\) pairs, which result into, respectively, 256- and 512-bit binary descriptors. We briefly review each selection method in the following.

The first proposal applies a dimensionality reduction procedure to the original real-valued descriptor and, then, combines all-against-all possible pairs from the resulting subvectors. To this end, we make use of Gaussian Random Projections (Dasgupta 2013) to transform the original space into a simpler K-dimensional space. The conversion is performed by means of a projection matrix \(R \in \mathbb {R}^{D \times k}\). Each entry is independently sampled from a standard normal distribution N(0,1). In our experiments, we project SuperPoint descriptors onto 72 and 96 dimensions, which results into, respectively, 9 and 12 subvectors \(d_x\) per descriptor. We choose all the possible pairs on each case, respectively and , discarding four combinations in the first case –(0,1), (0,2), (6,8) and (7,8)– to reach the 32 pairs of the final binary descriptor, and discarding two combinations –(0,1) and (10,11)– to end with 64 pairs in the second case. There is no particular reason for discarding these pairs, except for the fact of involving the first and last components of the descriptor in both cases.

As a second proposal, we consider one of the approaches assessed by the BRIEF descriptor authors in Calonder et al. (2010), which choose randomly the N pairs. In the seminal work, this random criterion seems to provide better results than the other pair selection schemes considered. We thus generate two random patterns of 32 and 64 pairs from all the possible combinations of M subvectors.

As a third alternative, we propose a novel but simple method to produce the corresponding N pairs. To generate a 256-bit descriptor, each subvector \(d_k\) is paired with its neighbours \(\{d_{k - 1}, d_{k + 1}\}\). Similarly, the 512-bit descriptor is computed using the two nearest neighbours of \(d_k\), i.e. \(\{d_{k - 2}, d_{k - 1}, d_{k + 1}, d_{k + 2}\}\). For the 256-bit case, in practice, one has to consider pairs \((d_{k}, d_{k + 1})\), \(k = 0\,..\,30\), together with pair \((d_{31},d_0)\),Footnote 1 for a total of 32 pairs. For the 512-bit case, one has to consider the previous 32 pairs and pairs \((d_{k}, d_{k + 2})\), \(k = 0\,..\,29\), together with \((d_{30},d_0)\) and \((d_{31},d_1)\), for a total of 64 pairs.

As is shown in Sect. 4.2, this last approach is the one leading to the best performance. To finish, we denote the final set of binarized descriptors as \(P_t\).

3.1.2 Line description

Lines are detected in the image using the Line Segment Detector (LSD) (Grompone von Gioi et al. 2010). LSD is a linear-time high-precision line segment detector with subpixel accuracy without parameter tuning. However, LSD tends to break long line segments into shorter segments or even duplicated segments. The existence of these replicas often affects the matching procedure and increases the storage requirements. To minimize the incidence of these misbehaviours, we perform a line merging step after segment detection. Formally, each line segment k is represented by its start point \(\mathbf {s}_{k}\) and its end point \(\mathbf {e}_{k}\) in homogeneous coordinates. The direction vector l for segment k is thus given by:

$$\begin{aligned} l_{k}=\frac{\mathbf {s}_{k}-\mathbf {e}_{k}}{\left\| \mathbf {s}_{k}-\mathbf {e}_{k}\right\| }\,. \end{aligned}$$

For segment merging, we consider every possible pair of lines \((l_i, l_j)\). Lines are merged if the minimum Euclidean distance for the four possible combinations of end-points is smaller than a threshold and the angle between line-segments \(\theta _{i j}\), computed as

$$\begin{aligned} \theta _{i j} = \arctan {\left( \frac{\left\| \overrightarrow{l_{i}} \times \overrightarrow{l_{j}}\right\| }{\overrightarrow{l_{i}} \cdot \overrightarrow{l_{j}}}\right) }\,, \end{aligned}$$

is close to 0 or \(\pi \) radians.

Similarly to LiPo-LCD (Company-Corcoles et al. 2020), the lines found are described using the binary form of the Line Band Descriptor (LBD) (Zhang and Koch 2013), conforming the final set of binary line descriptors \(L_t\).

3.2 Searching for loop closure candidates

Similarly to our earlier work (Company-Corcoles et al. 2020), we make use of a hierarchical tree structure to efficiently retrieve loop closure candidates, what allows us to manage an increasing number of visual words, i.e. online operation. To this end, we employ OBIndex2 (Garcia-Fidalgo and Ortiz 2018) as an incremental BoW scheme for binary descriptors. In combination with an inverted index, it allows for fast image retrieval.

More precisely, LiPo-LCD\(^{++}\) maintains two instances of OBIndex2, one for point features and another one for line features. Each instance builds an incremental visual vocabulary jointly with an inverted index of images. For a given query image \(I_t\), a search is performed on each index to retrieve the most similar images with regard to each visual perspective. To this end, we search each visual feature descriptor within its corresponding dictionary to find the closest visual word. Next, using the inverted index, a list of images where the visual word has been observed is retrieved, and, for each one, a TF-IDF weight is added to the final score of the image. Candidate images are then sorted according to this final score. For a further explanation, the reader is referred to Garcia-Fidalgo and Ortiz (2018).

As a result, two lists are obtained: (1) the m most similar images using point features \(C_p^t = \{I_{p_{0}}^t, \ldots , I_{p_{m - 1}}^t\}\) and (2) the n most similar images using line features \(C_l^t = \{I_{l_{0}}^t, \ldots , I_{l_{n - 1}}^t\}\). Each list is sorted by the associated scores, respectively \(s_p^t(I_{t}, I_{j}^t)\) and \(s_l^t(I_{t}, I_{j}^t)\), which measure the similarity between the query image \(I_t\) and each candidate image \(I_j\). Since the range of these scores varies depending on the distribution of visual words in each vocabulary, we map them onto a common range [0, 1] by applying min-max normalization given by:

$$\begin{aligned} {\tilde{s}_{k}^{t}\left( I_{t}, I_{j}^t\right) = \frac{\displaystyle s_{k}^{t}\left( I_{t}, I_{j}^t\right) - s_k^t\left( I_{t}, I_{min}^t\right) }{s_k^t\left( I_{t}, I_{max}^t\right) -s_k^t\left( I_{t}, I_{min}^t\right) }, \;\;} \end{aligned}$$

where \(s_k^t\left( I_{t}, I_{min}^t\right) \) and \(s_k^t\left( I_{t}, I_{max}^t\right) \) correspond, respectively, to the minimum and the maximum scores of an image candidate list, being \(k \in \{p, l\}\). To keep control of the size of the list of candidates for a specific feature \(C_k^t, k \in \{p,l\}\), images whose normalized score \(\tilde{s}_k^t\) is lower than a predefined threshold are discarded. Finally, the respective visual vocabularies are updated accordingly to the set of point and line binary descriptors resulting from the current image.

3.3 Fusion of the lists of candidates

The next step is to integrate both lists \(C_p^t\) and \(C_l^t\), each one providing loop candidates from a different visual perspective, into a joint list of similar images \(C_{pl}^t\).

3.3.1 Overview

Techniques that combine multimodal information for image retrieval can be generically classified into early or late fusion approaches (Bhowmik et al. 2014). Early fusion refers to the combination of the information at the feature descriptor level, before being processed by a retrieval system. Conversely, late fusion works with the outputs of different retrieval systems. Regarding the latter, a common approach is to aggregate multiple ranked lists, corresponding to different visual features, using a function that provides a global confidence measure. This function can be implemented in several ways: (1) using a ranked-voting procedure that considers the position of each candidate in the list for each retrieval system; or (2) using a score-based method that combines the candidates by weighing them according to the importance of each visual feature in the query image. In a previous work (Company-Corcoles et al. 2020), we adopted a ranked-voting scheme where each candidate was scored inversely proportional to their position in the list. Despite its good performance, this scheme did not take into account the relevance, or even the existence, of each visual feature in the query image, preventing the use of information that could certainly be used to fit the approach with dynamic adaptation to each case.

To this end, in this work, we adopt a query-adaptive late fusion approach which exploits a score-based method to leverage the information coming from both sorts of visual features. Particularly, we propose a late fusion approach based on the work by Zheng et al. (2015), where the importance of a visual feature for a query image can be attributed to the shape of the curve of the resulting scores, sorted in descending order. To be more precise, they argue that this curve is L-shaped for a good feature, indicating that a small portion of the database images attain a high score, while a bad feature distributes the scores along more database images. LiPo-LCD\(^{++}\) accounts for this behaviour essentially by means of the calculation of the Area Under the Curve (AUC) of the scores curve, to compute adaptively the importance of each feature for each query image. The details for this computation can be found next.

3.3.2 Computation of feature relevance

Fig. 5
figure 5

(left) \(\widetilde{f}_p\) and \(\widetilde{f}_l\) curves, in, respectively, green and blue colour. The red segments correspond to the removed candidates for each case. (right) Selected and rejected frames to compute \(A_k\) on each case. Frame numbers are in black for hits and in red for failures, and as green/blue dots on the left plots (Color figure online)

Following the aforementioned ideas, we propose to compute the importance of each visual feature, for a given query image, according to the AUCs of the sorted scores of lists \(C_k^t,\ k\in \{p,l\}\). Formally, we denote \(f_k\) as the curve of the corresponding normalized scores \(\tilde{s}_k^t\), i.e. \(f_k(j) = \tilde{s}_k^t(I_t, I_{j}^t)\) for a specific image candidate j. As a first step, each curve \(f_k\) is analysed to detect and remove flat tails, i.e. detecting when the curve stops decreasing. To this end, we compute a linear approximation, from lower to higher scores, for each pair of consecutive points of the curve. Then, the analysis is finished when the magnitude of the resulting slope is above a threshold, discarding the remaining candidates and leading to a shorter curve \(\widetilde{f}_k\). Assuming that the final number of candidates is \(C_k\), its AUC, denoted by \(A_k\), is calculated by integrating \(\widetilde{f}_k\) using the composite trapezoidal rule using equally-sized partitions over the interval \([0, C_k-1]\):

$$\begin{aligned} A_k \approx \left( \sum _{n=1}^{C_k-2} \tilde{f}_k(n)+\frac{\tilde{f}_k(C_k-1)+\tilde{f}_k(0)}{2}\right) . \end{aligned}$$

Once the AUC for points \(A_p\) and lines \(A_l\) has been calculated, the weights for point features \(w_p\) and line features \(w_l\) are given by:

$$\begin{aligned} w_{p}=\frac{\displaystyle \frac{1}{A_p}}{\displaystyle \frac{1}{A_p} + \frac{1}{A_l}},\ w_{l} = 1 - w_{p}. \end{aligned}$$

We use the inverse of \(A_k\) to score higher smaller areas and so capture better the L-shape.

3.3.3 Combination of scores

Weights computed in the previous step are used to leverage scores of image candidates from the different visual perspectives, resulting into a global confidence measure. Among the different options considered to consolidate these scores in Zheng et al. (2015), we use the sum rule given its better tolerance to false positives. Therefore, we compute a joint similarity measure \(S^t(I_{t},I_j^t)\) between the query image \(I_t\) and a candidate image \(I_j^t\) as:

$$\begin{aligned} S^t(I_{t},I_j^t) = \tilde{s}_p^t(I_{t}, I_{j}^t)\cdot w_p + \tilde{s}_l^t(I_{t}, I_{j}^t) \cdot w_l. \end{aligned}$$

Finally, an integrated image list \(C_{pl}^t\) is obtained by sorting the resulting candidates according to their integrated scores \(S^t(I_{t},I_j^t)\).

Figure 5 illustrates the above-described merging procedure with an example. Curves \(f_p\) and \(f_l\) for, respectively, feature points and feature lines are shown on the left, while the respective images are shown on the right. The images involved in the calculation of \(A_p\) and \(A_l\) correspond to the ends of the blue and green segments of the \(\widetilde{f}_k\) curves (for, respectively, lines and points). Removed image candidates for each case are shown as the ends of the red segments. Black captions below the images on the left denote correct loop closures (according to the ground truth), labelled with a coloured dot on the left curves, whereas captions for non-correct loop closures are shown in red. As a result of the whole process, for this particular case, \(w_p = 0.46\) and \(w_l = 0.54\), indicating that the importance of each feature, for the example query image, is almost the same, as the \(\widetilde{f}_k\) curves already suggest. It can also be observed from this figure that discarded images are mostly incorrect loop closure candidates, while the selected frames to compute \(A_k\) are almost the same in both cases. It can also be noticed that, in this example, points are more affected by perceptual aliasing than lines.

3.4 Dynamic islands computation for loop candidates filtering

After merging the resulting lists \(C_{p}^t\) and \(C_{l}^t\) into \(C_{pl}^t\), we filter \(C_{pl}^t\) to prevent that images coming from the same area compete among them as final loop candidates. To this end, we make use of the concept of dynamic islands (Garcia-Fidalgo and Ortiz 2018). A dynamic island \(\varUpsilon _{n}^{m}\) groups images whose timestamps range from m to n. The size of an island is not fixed, but it depends on the similarities between neighbouring images and the camera velocity, to adapt to the specific image stream. For each query image \(I_t\), a set of islands \(\varGamma _{t}\) is calculated processing images in the list \(C_{pl}^t\) in ascending order by using the following procedure. Every image \(I_{i} \in C_{pl}^{t}\) that lies in the [mn] interval is associated to an existing island \(\varUpsilon _{n}^{m}\) ; in such a case, the time interval of the island is updated to accommodate \(I_t\) and some previous and posterior frames. \(I_t\) is associated to a new island created around time t if \(I_t\) does not fit in any existing island. Note that this procedure relies on a sorted list of images.

After processing all images in \(C_{pl}^{t}\), a global score g is computed for each island as:

$$\begin{aligned} g(\varUpsilon _{n}^{m}) = \frac{\displaystyle \sum _{i=m}^n S^t(I_t,I^t_{i})}{n - m + 1}\,, \end{aligned}$$

which is an indicator of how similar an area of the environment is with regard to the query image, taking into account the two visual perspectives. We also rely on the idea of priority islands, defined as the islands in \(\varGamma _{t}\) that overlap in time with the island selected at time \(t-1\), \(\varUpsilon ^{*}(t - 1)\). The rationale behind these islands is based on the fact that consecutive images should close loops with areas of the environment where previous images also closed a loop. Originally, the authors of Garcia-Fidalgo and Ortiz (2018) selected, as a final island, the priority island with the highest score g, if any. However, they did not consider the spatial arrangement of the features and the fact that perceptual aliasing might produce incorrect island associations, especially in human-made environments. For this reason, as an alternative, LiPo-LCD\(^{++}\) retains an island for the next time instant if the final selected loop candidate satisfies the spatial verification procedure, as explained in Sect. 3.5. Then, when the best island \(\varUpsilon ^{*}(t)\) is determined, the image \(I_c\) with the highest combined weighted score \({S^t}\) of \(\varUpsilon ^{*}(t)\) is selected as its representative and passed to the next spatial verification stage.

3.5 Spatial verification

One of the main drawbacks of BoW schemes is that they ignore the spatial arrangement of visual words in the image. Under severe perceptual aliasing conditions, this might lead to produce incorrect loop associations which can be detected by checking whether some minimal physical constraints on the camera motion are satisfied or not. Therefore, a spatial verification step usually takes place to validate the loop candidate. The most widely used approach in this regard assumes a rigid motion from the camera and verifies if the motion of image features is compatible with epipolar geometry computing the fundamental matrix within a robust estimation framework, e.g. RANSAC-based, and requiring a minimum number of inliers. However, RANSAC performance decreases, either when the number of outliers becomes large, or when the camera motion cannot be modeled by means of the fundamental matrix (Ma et al. 2019). Other approaches (Zheng and Doermann 2006; Bian et al. 2017; Lowry and Andreasson 2018; Ma et al. 2019) tackle these limitations by the evaluation of neighbourhood points. Among them, Grid-based Motion Statistics (GMS) (Bian et al. 2017) and Locality Preserving Matching (LPM) (Ma et al. 2019) perform the matching task in an affordable time. However, unlike GMS, LPM avoids that various motion models, particularly when the number of detected features is low, affect the performance. This is the reason why we adopt LPM as a key component in the spatial verification step of LiPo-LCD\(^{++}\).

LPM is based on the fact that the true matches from a list of candidate matches should maintain the spatial neighborhood relationships among image features, i.e. should preserve the topological structures of the scene. In this work, we modify LPM by incorporating line matches into the procedure.

3.5.1 Feature matching

The first step is to find a set of consistent matches between the query and the candidate images. In our approach, we perform a separate matching procedure for points and lines. On the one hand, matchings between the point descriptor sets \(P_c\), for a candidate image \(I_c\), and \(P_t\), for the query image \(I_t\), are validated using the Nearest Neighbor Distance Ratio (NNDR) test (Lowe 2004). However, in our experiments, this approach has not led to adequate performance for lines when using a restrictive ratio, especially in low-textured environments. Instead, we propose, for lines, to combine appearance and geometrical information into the matching process. Firstly, segments are matched by their appearance. Next, geometric properties of lines are used to further filter the initial set of matches. The details can be found in the following.

Given \(L_c\) and \(L_t\) as the sets of line descriptors for, respectively, a candidate image \(I_c\) and the query image \(I_t\), we first obtain an initial set of appearance-based line matches by applying the NNDR test, but with a more permissive threshold than for points. Next, we discard those matching pair segments whose lengths change significantly from one frame to the other. For this purpose, we calculate the ratio length \(\mu _{i j}\) for every candidate matching pair (\(l_{i}\), \(l_{j}\)), as

$$\begin{aligned} \mu _{i j}=\frac{\max \left( \left\| \mathbf {s}_{i}-\mathbf {e}_{i}\right\| , \left\| \mathbf {s}_{j}-\mathbf {e}_{j}\right\| \right) }{\min \left( \left\| \mathbf {s}_{i}-\mathbf {e}_{i}\right\| , \left\| \mathbf {s}_{j}-\mathbf {e}_{j}\right\| \right) }. ~ \end{aligned}$$

Matching candidates with similar lengths, i.e. \(\mu _{i j} \approx 1\), are accepted.

Table 1 Features of the datasets involved in the experiments

Subsequently, we asses whether the angle \(\theta _{i j}\) between the two line segments, calculated using Eq. 5, is approximately 0 or \(\pi \). To avoid that camera rotations between images affect this step, \(\theta _{i j}\) is compensated using the global rotation between frames, as performed in Zhang and Koch (2013). Surviving line matches are considered for the local neighborhood evaluation.

3.5.2 Local Neighborhood Consistency Assessment

Once the set of putative matches for each feature are generated, we assess their consistency for their local neighborhood. For this purpose, we propose a modification of the LPM algorithm, which we will name as LP-LPM from now on, as it combines points and lines. This combination has shown to be able to enhance the matching performance in low-textured environments, where the performance of LPM usually drops. As it is explained in Ma et al. (2019), the distance between two feature points may vary under viewpoint changes, but local neighborhood structures should be similar. The generation of these local neighborhood structures unfortunately degrades when the number of detected features is low. Combining points and lines however allows for an increasing number of features over the image and leads to a better estimation of such structures.

For a start, point matches are incorporated into the LP-LPM scheme as usual, while line matches are introduced as point correspondences relating their particular endpoints \(s_i\) and \(e_i\) for, respectively, the start and the end of the line segment. First, for every line matching pair (\(l_{i}\), \(l_{j}\)), we determine the resulting point correspondences according to the value of the compensated angle \(\theta _{i j}\). If \(\theta _{i j}\) is approximately 0, the line segments can be considered to have the same direction, and, therefore, we can add the point correspondences (\(s_i\), \(s_j\)) and (\(e_i\), \(e_j\)) to the LP-LPM framework. In case \(\theta _{i j}\) is approximately \(\pi \), the line directions result to be opposite vectors and, hence, pairs (\(s_i\), \(e_j\)) and (\(e_i\), \(s_j\)) are used as correspondences. Next, LP-LPM proceeds generating two sets of inliers. We consider a line match as an inlier if at least one of its matched endpoints has been regarded as an inlier by LP-LPM. Finally, the selected loop candidate \(I_c\) is validated as a loop closure if the total number of inliers, comprising the inliers for both features, is above a threshold. It is discarded otherwise.

4 Experimental results

In this section, we conduct a set of experiments on several publicly available datasets to evaluate the performance of LiPo-LCD\(^{++}\). We also compare our solution with a representative set of state-of-the-art approaches. All experiments have been run on an Intel Core i7-9750H (2.60 GHz) processor with 16 GB RAM. In addition, an Nvidia GeForce GTX 980 GPU has been used to execute SuperPoint,Footnote 2 using a Python implementation on TensorFlow.

4.1 Methodology

We have selected 8 public datasets, which correspond to different environmental conditions: CityCentre (Cummins and Newman 2008) (CC), EuRoC Machine Hall 05 (Burri et al. 2016) (EuR5), KITTI 00 (Geiger et al. 2012) (K00), KITTI 05 (Geiger et al. 2012) (K05), KITTI 06 (Geiger et al. 2012) (K06), Lip6 Indoor (Angeli et al. 2008) (L6I), Lip6 Outdoor (Angeli et al. 2008) (L6O) and the Malaga 2009 Parking 6L (Blanco et al. 2009) (MLG). Table 1 provides a summary of these datasets. For each one, we use the ground truth provided by the original authors except for the KITTI sequences, for which we use the data referred by Arroyo et al. (2014), and the EuR5 and MLG datasets, for which we use the data available from Tsintotas et al. (2019).

Precision-recall (PR) metrics are used to evaluate the global performance of every approach. Given that the aim of an LCD method is to be used in a real SLAM solution, false positives are considered critical. For this reason, we are interested in the maximum recall at 100% precision.

Regarding LiPo-LCD\(^{++}\) parameters, we have used either default values, or values which experimentally have resulted in good performance. OBIndex2 and dynamic islands have been configured as explained in Garcia-Fidalgo and Ortiz (2018), as well as in our previous work (Company-Corcoles et al. 2020). The maximum number of features for SuperPoint has been set to 1500. Line detection and description methods have been executed using the default parameters. The binarization of SuperPoint descriptors, apart from the desired size of the final descriptor, does not require any additional setup. Regarding the query-adaptive late fusion stage, the threshold for the segment slopes has been set to 0.025. We have also limited the maximum value of weights \(w_p\) and \(w_l\) to 0.8. Concerning the geometric spatial verification stage, 0.8 and 0.95 have been used as NNDR values for, respectively, points and lines. Line matches have been discarded if the ratio length \(\mu _{i j}\) is above 2.5, and in case \(\theta _{i j}\) deviates more than \(30^\circ \) from 0 or \(\pi \). LPM runs using the default parameters provided by the authors.

4.2 Effectiveness of the binary descriptors

Table 2 Maximum recall at 100% of precision for the 7 descriptors considered. Winners are indicated in bold face

In this section, we assess the effectiveness of the binarization procedure. We empirically evaluate the three configurations for selecting pairs introduced in Sect. 3.1.1, denoted by, respectively, All (A), Random (R), and Neighbors (N). For each configuration, we consider the two possible versions of 32 and 64 pairs, resulting into binary descriptors of 256 and 512 bits, respectively. Additionally, we also consider a less sophisticated method that quantizes a 256-dimensional SuperPoint descriptor by thresholding each real-value component, hence resulting into a 256-bit descriptor. The threshold for each component is chosen as the average value of that component in the processed dataset. In the following, we denote this method as Q-256.

The L6I and the EuR5 datasets have been chosen for this experiment. L6I is a full indoor scenario which can be considered as a low-textured environment with high perceptual aliasing. Conversely, EuR5 is also a low-textured indoor environment which presents scenes under severe illumination and scale changes. To measure the effectiveness of each configuration, we compute the maximum recall that can be achieved by the whole system maintaining precision at 100%. Results are shown in Table 2. The highest recall has been obtained for the N-512 configuration for both datasets. As it is shown later, the average time difference required per frame using the N-256 and the N-512 configurations is approximately 7% in the EuR5 dataset. Taking into account both performance perspectives, we decide to adopt the configuration leading to best PR.

4.3 LCD performance breakdown

In this section, we evaluate the net effect of the proposed contributions to the final LCD performance. More precisely, we compute what is the maximum recall obtained by LiPo-LCD\(^{++}\) at 100% of precision, but using different alternatives on each of the steps of our pipeline. Results are shown in Table 3. For the Feature Extraction (FE) stage, we consider ORB and our binarized SuperPoint descriptor (b-SP) as point descriptors, along with the LBD descriptor for lines. Regarding the Late Fusion (LF) options, we consider: (1) our previous Borda Count-based system (Company-Corcoles et al. 2020) and (2) the Query-Adaptive Late Fusion (QALF) method proposed in this work. Finally, RANSAC and the proposed LP-LPM are the choices studied for the Spatial Verification (SV) stage. The L6I dataset is selected for this experiment because it only contains indoor scenes and therefore it can be regarded as the most low-textured dataset considered in this work. As it can be observed, every contribution improves a specific part of the pipeline, which results into a slightly increase of the final recall. Overall, the new pipeline exhibits enhanced performance with regard to our previous contributions: 11.16% over the recall reported in Garcia-Fidalgo and Ortiz (2018), where only points were used, and 9.1% that reported in Company-Corcoles et al. (2020), which involved both point and line features.

Table 3 Maximum recall at 100% of precision for L6I. \(\dagger \) and \(\diamond \) respectively refer to the solutions described in Garcia-Fidalgo and Ortiz (2018) and Company-Corcoles et al. (2020)

4.4 General performance

This section focuses on the global performance of LiPo-LCD\(^{++}\). We measure the accuracy of loop closure detection, and we report on the computational times as well as on the evolution of the vocabulary size.

4.4.1 LCD accuracy evaluation

Figure 6 reports on the accuracy of loop closure detection of LiPo-LCD\(^{++}\) for all datasets. As usual, we employ PR curves, which result from modifying the threshold on the total number of inliers required to accept a loop. As can be observed, LiPo-LCD\(^{++}\) exhibits a stable behaviour for all datasets, achieving high recall values while keeping precision at 100%.

Fig. 6
figure 6

PR curves for each dataset. In all plots, P is 1 for R below 0.75 

4.4.2 Computational times

Average computational times are summarized in Table 4. These values have been computed from the K00 dataset, since it is the largest one considered in this work. Feature Extraction (FE) includes line and point detection and description, as well as the proposed binarization procedure for points using the N-512 configuration. On average, point FE requires about 9.58 ms per image, from which only 1.96 ms corresponds to the binarization of point descriptors. These times are remarkably low in contrast to the 457 ms required by, for instance, D2-Net (Dusmanu et al. 2019) and the 542 ms of R2D2 (Revaud et al. 2019), involving only feature detection and the generation of real-valued descriptors. The Vocabulary Update (VU) and Search for Candidates (SC) steps are usually slower for points than lines, due to the higher number of descriptors that have to be handled. Times required to merge candidate lists and select dynamic islands are included in SC because they are almost negligible. Finally, the Spatial Verification (SV) step only makes sense when both points and lines are jointly computed. The average response time per image of the whole pipeline turns out to be 406.52 ms in a multi-thread implementation, where points and lines are each processed within a separate thread.

Table 4 Average response time (ms) per image, calculated for each stage of the pipeline. These times were computed using the K00 dataset

The evolution of computational times as more images are processed is additionally illustrated in Fig. 7. We also use the K00 dataset for this assessment, for the same reason as above. As can be observed, times for VU and SC remain relatively stable over time (remember that our approach implements a dual incremental BoW scheme). SV times grow occasionally due to the amount of inliers required to validate these candidates within the LP-LPM scheme. If required for a specific application, the number of candidates can be reduced.

Fig. 7
figure 7

Evolution of the computational times for each part of the pipeline using the K00 dataset

4.4.3 Evolution of the vocabulary size

As stated previously, LiPo-LCD\(^{++}\) makes use of two vocabularies, one for point features and one for line features, that store visual words of, respectively, 512 and 256 dimensions. Figure 8 shows that (1) globally the number of visual words grows as more images have been processed, and (2) the amount of line features is considerably lower than the number of point features, both (1) and (2) as expected. (The L6I dataset is missing in the plot because of the significantly lower number of images it consists of.)

Fig. 8
figure 8

Evolution of the vocabulary size for each feature. The amounts of point and line features are respectively denoted by continuous and dashed lines

Fig. 9
figure 9

Comparison of vocabulary size and running times for different strategies to reduce the computational requirements of LiPo-LCD\(^{++}\). Default corresponds to the proposed approach using the N-512 descriptor and 1500 point descriptors per image. MSR stands for Map Size Reduction, while the rest of cases refer to the maximum number of point descriptors requested per image

4.5 More on the computational requirements

In this section, we evaluate the effect of two strategies to save computational resources on the performance of LiPo-LCD\(^{++}\), to contemplate the possibility of running it on computationally-restricted hardware (otherwise, one may be interested in achieving the highest performance in terms of recall). More precisely, among other alternatives (Zaffar et al. 2020), we first assess how reducing the map affects the performance of LiPo-LCD\(^{++}\). To this end, we consider a simple method, denoted by Map Size Reduction (MSR), which, using the N-512 configuration as a basis, processes one out of every two images of the input dataset. A second strategy focuses on limiting the vocabulary size. For this purpose, we reduce the number of point features extracted from each image using the relevance measure available for SuperPoints; we consider 1000, 750, 500 and 250 features per image. Regarding the vocabulary for line features, we do not care for its size since it is negligible in contrast with that of point features. The results in terms of vocabulary size and computational times are detailed in, respectively, Fig. 9a and b. The corresponding recall values at 100% of precision for the different datasets can be found in Table 5.

As can be observed in Fig. 9a, the vocabulary size required by MSR is similar to the one resulting when using 500 point descriptors. However, the computational times reported in Fig. 9b demonstrate that MSR leads to similar running times than using 1000 or 1500 point descriptors per image.

Regarding the reduction of the map size reported in Table 5, it can be observed that the recall is dependent on the frame rate/camera speed, e.g. for datasets with higher frame rate or low camera speed, better performance is exhibited. On the other side, the recall tends to decrease when using less point features per image. However, despite this reduction, results for K00, K06 and L6O keep similar for all configurations. Another relevant observation is that using 750 or 1000 descriptors per image achieves similar performace than using the default configuration, i.e. 1500 descriptors, in most cases. Regarding L6O, the number of features that are detected is no more than 700, this is the reason why the recall is the same for the 750, 1000 and 1500-point features cases. As for L6I, no more than 250 features can be found, reason by which it is missing in Table 5.

Table 5 Maximum recall at 100% of precision for different configurations regarding computational complexity. Default corresponds to the proposed approach using the N-512 descriptor and 1500 point descriptors per image. MSR stands for Map Size Reduction, while the rest of cases refer to the maximum number of point descriptors requested per image

4.6 Comparison with other solutions

Table 6 Maximum recall at 100% precision for several off-line approaches (top), on-line approaches (middle) and LiPo-LCD\(^{++}\) (bottom). The best and the second best scores are shown in bold and italic, respectively

In this section, we compare the performance of LiPo-LCD\(^{++}\) against other state-of-the-art solutions. The maximum recall achieved at 100% of precision for each approach is summarized in Table 6. The reported results come from the original works, except for Gomez-Ojeda et al. (2019), which was executed by ourselves using the vocabularies and the default parameters provided by the authors, and Tsintotas et al. (2018); Tsintotas et al. (2019, 2021) for the L6I dataset. Non-available recall values are reported as n.a. Approaches that do not reach 100% precision for a specific dataset are indicated by a dash (−).

As can be observed, the combination of points and lines achieves the highest recall for almost all datasets. This is specially evident for the case of L6I and Eur5, which are regarded as the most low-textured scenarios considered in this work, which actually is the focus of our approach. Furthermore, the incorporation of lines in an LCD method not only improves the performance in low-textured scenarios, but also in the rest of datasets where line segments are noticeable. This can be observed when comparing either LiPo-LCD\(^{++}\) or Company-Corcoles et al. (2020), which also combines point and line features, with the method proposed in Garcia-Fidalgo and Ortiz (2018), where only point features are used. Notice that LiPo-LCD\(^{++}\) improves the performance attained by our previous solution (Company-Corcoles et al. 2020) for these low-textured datasets, and provides a competitive performance for the remaining datasets. Furthermore, LiPo-LCD\(^{++}\) obtains better recall values almost for every dataset compared to the solution by Yue et al. (2019), which describes the images using a non-binarized version of the SuperPoint descriptor. Despite LiPo-LCD\(^{++}\) achieves better recall values than our previous solution in the MLG dataset, the performance attained is rather low with regard to other more successful approaches. We have observed that this can be the result of wrong line detections between consecutive frames, which are due to the bad quality of the images and the corresponding perceptual aliasing that occurs among those lines, leading to a subsequent decrease in the pipeline performance. Notice that our proposal also outperforms (Gomez-Ojeda et al. 2019; Han et al. 2021), which are the solutions most similar to LiPo-LCD\(^{++}\), given that they are the only ones that combine points and lines for loop closure detection.

To finish, in Table 7, we compare the vocabulary size of LiPo-LCD\(^{++}\) with other state-of-the-art solutions in terms of memory consumption. As expected, the requirements for LiPo-LCD\(^{++}\) are higher than the ones for a method that only uses points (Garcia-Fidalgo and Ortiz 2018). This is obviously because of the two visual vocabularies, but also due to the fact that our binary point descriptor is 512-dimensional, which is larger than the ORB descriptor used by Garcia-Fidalgo and Ortiz (2018). Notice also that limiting the number of point features to 1000 - 250 we reduce the gap as for memory requirements with regard to Tsintotas et al. (2021) without compromising much our LCD performance (as reported in Table 5).

Table 7 Memory consumption (Mb) for several state-of-the-art approaches. Lowest values are shown in bold

5 Conclusions and future work

This work has introduced LiPo-LCD\(^{++}\), an appearance-based LCD approach that combines point and line features to increase the performance in low-textured environments. To search for loop candidates, our approach is based on two instances of an incremental BoW scheme specifically devised for binary descriptors, one instance for each feature. Besides, we develop a binarization procedure to incorporate SuperPoint into LiPo-LCD\(^{++}\) as a feature point extractor, given the advantages of CNN-based methods recently reported. A query-adaptive late fusion approach is also adopted to merge lists of candidates obtained from the two visual perspectives. This fusion method is dynamically adapted to the current operating scenario, leveraging the final candidate scores according to the presence or absence of each kind of feature. Finally, we introduce a spatial verification method, which jointly validates the appearance and the geometric consistency of points and lines together. LiPo-LCD\(^{++}\) compares favourably with several state-of-the-art approaches for different public datasets and hence different environmental conditions.

Regarding future work, we plan to incorporate LiPo-LCD\(^{++}\) into a SLAM framework and analyze its performance in low-textured scenarios. We also plan to explore the use of other geometric entities, e.g. planes, for loop closure detection.