Attention-guided LiDAR segmentation and odometry using image-to-point cloud saliency transfer

Ding, Guanqun; İmamoğlu, Nevrez; Caglayan, Ali; Murakawa, Masahiro; Nakamura, Ryosuke

doi:10.1007/s00530-024-01389-7

Attention-guided LiDAR segmentation and odometry using image-to-point cloud saliency transfer

Regular Paper
Open access
Published: 24 June 2024

Volume 30, article number 188, (2024)
Cite this article

Download PDF

You have full access to this open access article

Multimedia Systems Aims and scope Submit manuscript

Attention-guided LiDAR segmentation and odometry using image-to-point cloud saliency transfer

Download PDF

Guanqun Ding^1,2,
Nevrez İmamoğlu^2,3,
Ali Caglayan²,
Masahiro Murakawa^1,4 &
…
Ryosuke Nakamura⁴

165 Accesses
Explore all metrics

Abstract

LiDAR odometry estimation and 3D semantic segmentation are crucial for autonomous driving, which has achieved remarkable advances recently. However, these tasks are challenging due to the imbalance of points in different semantic categories for 3D semantic segmentation and the influence of dynamic objects for LiDAR odometry estimation, which increases the importance of using representative/salient landmarks as reference points for robust feature learning. To address these challenges, we propose a saliency-guided approach that leverages attention information to improve the performance of LiDAR odometry estimation and semantic segmentation models. Unlike in the image domain, only a few studies have addressed point cloud saliency information due to the lack of annotated training data. To alleviate this, we first present a universal framework to transfer saliency distribution knowledge from color images to point clouds, and use this to construct a pseudo-saliency dataset (i.e. FordSaliency) for point clouds. Then, we adopt point cloud based backbones to learn saliency distribution from pseudo-saliency labels, which is followed by our proposed SalLiDAR module. SalLiDAR is a saliency-guided 3D semantic segmentation model that integrates saliency information to improve segmentation performance. Finally, we introduce SalLONet, a self-supervised saliency-guided LiDAR odometry network that uses the semantic and saliency predictions of SalLiDAR to achieve better odometry estimation. Our extensive experiments on benchmark datasets demonstrate that the proposed SalLiDAR and SalLONet models achieve state-of-the-art performance against existing methods, highlighting the effectiveness of image-to-LiDAR saliency knowledge transfer. Source code will be available at https://github.com/nevrez/SalLONet

AdVLO: Region Selection via Attention-Driven for Visual LiDAR Odometry

LESS: Label-Efficient Semantic Segmentation for LiDAR Point Clouds

Learning 3D Semantics From Pose-Noisy 2D Images with Hierarchical Full Attention Network

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Understanding 3D point clouds has become increasingly important with the rise of robotics technologies such as augmented/virtual/mixed reality and autonomous vehicles. Autonomous driving, for instance, allows vehicles to sense and respond to their environment without human intervention. However, ensuring system safety relies heavily on accurate perception and localization of the environment. Simultaneous Localization and Mapping (SLAM) [1] technology plays a critical role in the perception and planning process of autonomous vehicles by constructing a map of the surrounding environment and localizing the vehicle. Visual/LiDAR odometry estimation [2,3,4,5,6] is an essential component of a SLAM system, aiming to estimate the robot’s pose information from consecutive point clouds. Moreover, large-scale data-based applications, such as LiDAR semantic segmentation [7, 8] and odometry estimation [6] empower advanced robotics technologies. Similarly, the use of saliency information in 2D computer vision tasks such as image translation [9], object detection [10] and tracking [11, 12], key-point selection [13, 14], and person re-identification [15,16,17] has led to state-of-the-art results by capturing the pre-dominant information in a scene. However, the unstructured, unordered, and density-varied nature of point clouds makes it difficult for conventional point-cloud-based methods to process informative visual features effectively and rapidly in large-scale scenes. To address this challenge and enhance the performance of real-time autonomous vehicles, several works [18,19,20] have explored the use of saliency detection algorithms in point cloud data-based tasks, showing that the integration of efficient saliency knowledge can further enhance the performance of 3D point cloud understanding tasks.

In LiDAR odometry estimation, keypoint selection is often used to facilitate the learning of matching features by the model [22,23,24,25]. SIFT-based approaches, such as LodoNet [22], extract matched keypoint pairs, which are then used to learn point-wise features in Point-Net [26]. Alternatively, saliency-based point selection methods, such as that used by SalientDSO [23], demonstrate the potential benefits of incorporating saliency information into visual odometry. As shown in Fig. 1, while SIFT key points [21] and salient regions of saliency maps can both detect significant and consistent landmarks of the scene (e.g. buildings and traffic signs), saliency maps offer a more continuous and soft indication of attentive probabilities, unlike the sparse and discrete nature of key points. Thus, integrating saliency information has the potential to improve the performance of odometry estimation. However, despite significant progress in LiDAR odometry estimation, challenges remain, particularly in crowded environments with moving objects that can introduce noise and occlusions [27,28,29,30]. To address this issue, some odometry methods [27,28,29,30] use semantics to mitigate the adverse effects of moving object regions/points in the input data. Static objects can provide a stable and consistent reference point for geometry-based matching, which is critical for successful pose estimation. For example, early Iterative Closest Point (ICP) [31, 32] based odometry models [33, 34] estimate the transformation iteratively by minimizing matching errors between corresponding points of two scans.

In this paper, we focus on improving LiDAR odometry estimation and 3D semantic segmentation by learning robust and discriminative features with saliency information constraints. Specifically, we propose a saliency-guided 3D semantic segmentation method that exploits saliency cues to facilitate the model in robust feature learning. Also, we propose a saliency-guided LiDAR odometry approach that leverages attention information and semantics to improve performance. Figure 2 illustrates an overview of the proposed framework of image-to-LiDAR saliency knowledge transfer for attention-guided LiDAR semantic segmentation and odometry estimation models.

Table 1 Comparison of existing saliency detection datasets on point clouds

Full size table

Several attempts have been made to find effective solutions for saliency detection on point clouds [13, 18,19,20, 38]. In Table 1, we compare the existing saliency detection datasets on point clouds. For saliency detection on point clouds, most challenges are yet to be explored further. First, previous saliency methods such as [18, 20] have operated on mesh data of 3D objects, where scenes are less complicated with only a few background points. Second, due to the lack of human-annotated training datasets, it is unlikely to employ the supervised learning scheme for saliency detection on point clouds. Similar to the challenges faced in 3D object detection, where approaches have been proposed to reduce dependence on extensive supervision due to the laborious and costly nature of manual labeling [39, 40] it is highly desired to develop a practicable pipeline based on deep learning for saliency prediction on large-scale point clouds. This study is an extension of our previous work [38] for point cloud saliency prediction and 3D semantic segmentation. In our previous work [38], we have designed a universal framework and a point cloud saliency dataset (FordSaliency) to transfer saliency distribution knowledge for point clouds. Then an attention-guided two-stream network is proposed to improve the accuracy of LiDAR semantic segmentation task. The first stream is a LiDAR based saliency network trained on FordSaliency dataset that guides the segmentation task. The second stream is a segmentation module that predicts the semantics of the input point cloud [38].

In this work, we propose a saliency-guided deep self-supervised odometry model that combines the saliency and semantic predictions of SalLiDAR [38] for the LiDAR odometry estimation. In brief, the key and additional contributions of this work can be summarized as follows:

We propose a saliency-guided LiDAR odometry estimation model with a self-supervised learning manner. The proposed odometry model consists of three modules: saliency module, semantic module, and odometry module. In [38], we used point cloud saliency to improve semantic segmentation predictions on point clouds. In this work, we not only utilize point cloud saliency but also incorporate point cloud semantic segmentation predictions within the self-supervised odometry module. The motivation is to focus on attentional static regions in scenes during the training of the odometry module, while avoiding dynamic objects that could hinder relative pose estimation between two sequential point cloud scans.
To mitigate the adverse effects of dynamic points on the LiDAR odometry model, we binarize the semantic map into dynamic and static points using the semantic labels defined in the SemanticKITTI dataset [41]. The point cloud, along with the binarized semantic map and saliency map, is then fed into the odometry module for feature learning.
To prioritize salient static points for point cloud matching, we propose a saliency-guided odometry loss that utilizes the saliency and binarized semantic maps to regulate the odometry module. This helps the module focus more on attentive points and improves the accuracy of point cloud matching. Our extensive experiments on benchmark datasets suggest that the proposed two-stream LiDAR odometry model with saliency and semantic knowledge improves the performance of odometry estimation and achieves better performance compared with the existing methods.

2 Related works

2.1 LiDAR odometry estimation

Recently, deep learning-based odometry models [5, 6, 42, 43] have been proposed to predict pose by learning more abundant features with powerful convolutional modules. In PWCLO-Net [6], Wang et al. propose a deep LiDAR odometry approach based on hierarchical embedding mask optimization, where a warp-refinement module with attentive cost volume structure refines the estimated pose in a coarse-to-fine manner, the attentive cost volume is used for the association between two point clouds. In DeLORA [42] model, Nubert et al. present a self-supervised LiDAR odometry model for pose regression without any ground-truth labels. Two consecutive range images converted from raw LiDAR point clouds are fed into DeLORA [42] model to output a rigid-body transformation, then the geometric transformation is applied to the source LiDAR scan and normal vector for obtaining transformed LiDAR scan and transformed normal vector. Afterward, a point-wise geometric loss between the transformed scan data and the target scan data can be calculated to guide the model to learn geometric-specific features, thus generating a transformation to match the transformed and target scans as closely as possible [42]. For this paper, we focus on the LiDAR odometry research based on deep learning, which has achieved great progress in recent years [3, 5, 43, 44].

2.2 Saliency detection on point cloud

Saliency detection aims to find the most eye-attracting locations in a visual scene, which can be traced back to the pioneering work of Itti et al. [45]. With rapidly emerging advances and applications of deep learning techniques, saliency detection on color images/videos [46] has made great progress in recent years. There are also several works [13, 14, 18,19,20, 35] for saliency computation on point clouds. For example, Ding et al. [18] propose a 3D mesh saliency calculation method by fusing local distinctness and global rarity features. Tinchev et al. [14] present a key point detector on point clouds by using saliency estimation. They calculate the gradient response of a differentiable convolutional network to obtain the saliency map. Then they use multiple fully connected layers to combine the saliency feature, point cloud context feature, and PCA features of point descriptors [14]. Zheng et al. [20] present a saliency computation method using a loss gradient approach that approximates point dropping in a differentiable manner of shifting points towards the point cloud center. However, saliency methods focusing on 3D meshes or indoor scenes are limited in their ability to process large-scale 3D point clouds such as 3D driving data. Also, saliency models extracting handcrafted descriptors may ignore informative representations for point clouds with varying density and complex backgrounds in outdoor scenarios.

2.3 LiDAR semantic segmentation

LiDAR semantic segmentation [7, 8, 26, 47,48,49] is a crucial 3D computer vision task for autonomous driving, which aims to predict the semantic class of each point on a LiDAR scan. As a pioneering point set-based method, PointNet [26] uses multiple layer perceptrons (MLPs) to learn point-wise features for classification and segmentation. RandLA-Net [47] presents randomly sampling the input point cloud, and employs a local feature aggregation module to compensate for information loss introduced by the random sampling. Considering the range property of LiDAR point cloud, Cylinder3D [8] proposes a solution to leverage cylinder partition for 3D semantic segmentation. It also brings an asymmetrical model to encoder-decoder voxel-based features by 3D sparse convolutional networks. PVKD [7] achieves the state-of-the-art performance of 3D semantic segmentation by applying the point-to-voxel knowledge distillation strategy to Cylinder3D [8] model. With RPVNet [49], the authors present a multi-modality fusion model that combines range-based, point-based, and voxel-based representations with a gated fusion module for LiDAR semantic segmentation.

3 Proposed framework

3.1 Problem formulation

Given an input point cloud ${\mathcal {P}}$={$p_{i} |$ i=1, ..., N, $p_{i} \in {\mathbb {R}}^{d}$} with a set of disordered points, where N represents the point number of LiDAR frame and each point $p_{i}$ could contain d dimensional features, such as point coordinates (x, y, z), colors (r, g, b), reflectivity, and normal feature. The objective of the saliency detection model on point cloud is to predict the saliency score map ${\mathcal {S}}$={$s_{i} |$ i=1, ..., N, $s_{i} \in [0, 1]$}, where $s_{i}$ denotes the saliency score of point $p_{i}$. After normalizing the saliency prediction, the closer the saliency score $s_{i}$ to 1, the more attentive the point $p_{i}$. In the 3D semantic segmentation task, its goal is to predict the semantic class map ${\mathcal {C}}$={$c_{i} |$ i=1, ..., N, $c_{i} \in {\mathbb {R}}$}, where $c_{i}$ indicates the semantic category of point $p_{i}$.

The objective of this work is to establish a self-supervised LiDAR odometry estimation model that is guided by saliency and semantic constraints, and can be trained without ground-truth pose. To achieve this goal, given the input of two consecutive LiDAR point clouds ${\mathcal {P}}_{t} \in {\mathbb {R}}^{d}$ and ${\mathcal {P}}_{t-1} \in {\mathbb {R}}^{d}$ at time t and $t-1$ with a set of disordered points, where each point p could contain d dimensional point-wise features, such as point coordinates (x, y, z), the range feature r, semantic feature c, and saliency feature s. The odometry model estimates a $3 \times 3$ rotational vector ${\textbf{q}} \in SO(3)$ and a $3 \times 1$ translational vector ${\textbf{t}}$, where the ${\textbf{R}}$ and ${\textbf{t}}$ compose the relative rigid transformation ${\hat{T}}_{t-1, t} \in SE(3)$ between point clouds ${\mathcal {P}}_{t}$ and ${\mathcal {P}}_{t-1}$. The ${\mathcal {P}}_{t}$ can be transformed into $\hat{{\mathcal {P}}}_{t-1}$ in the coordinate system of ${\mathcal {P}}_{t-1}$ by the transformation ${\hat{T}}_{t-1, t}$:

$$\begin{aligned} \hat{{\mathcal {P}}}_{t-1} = {\hat{T}}_{t-1, t} \odot {\mathcal {P}}_{t} \end{aligned}$$

(1)

where $\odot$ represents the point-wise matrix multiplication. Afterward, the point-wise matching loss between $\hat{{\mathcal {P}}}_{t-1}$ and ${\mathcal {P}}_{t-1}$ can be calculated to train the odometry model, thereby forcing the model to predict an optimal transformation ${\hat{T}}_{t-1, t}$. Also, the normal vector ${\mathcal {N}}_{t}$ of ${\mathcal {P}}_{t}$ can be transformed into $\hat{{\mathcal {N}}}_{t-1}$ in the coordinate system of ${\mathcal {P}}_{t-1}$ by the transformation ${\hat{T}}_{t-1, t}$:

$$\begin{aligned} \hat{{\mathcal {N}}}_{t-1} = rot({\hat{T}}_{t-1, t}) \odot {\mathcal {N}}_{t} \end{aligned}$$

(2)

Therefore, the odometry model can be trained in a self-supervised manner by calculating the point-wise matching loss, and it does not require the odometry ground truth $T_{t-1, t}$.

3.2 Framework overview

In Fig. 3, we show the overview of image-to-LiDAR saliency knowledge transfer for 3D point cloud understanding. There are three main sub-tasks: (1) image-to-LiDAR saliency knowledge transferring for generating a pseudo-saliency dataset of point clouds, (2) LiDAR-to-LiDAR pseudo-saliency learning by using LiDAR-based deep models, and (3) saliency-guided 3D point cloud understanding by integrating the saliency information.

Firstly, we propose a large-scale pseudo-saliency dataset (FordSaliency) for point clouds by assigning the saliency values of RGB images to corresponding point clouds registered on images. In Fig. 4, we show the visualization examples of point cloud and corresponding saliency pseudo-ground-truth map from our FordSaliency dataset. Then, we train LiDAR based models on the proposed pseudo-saliency dataset to learn point cloud saliency features. Next, we propose a saliency-guided two-stream network (SalLiDAR) for large-scale point cloud segmentation. The saliency prediction is not only used as an input feature for the semantic module but also adopted to saliency-guided loss to facilitate the semantic module (1) to learn more rich features of salient points and (2) to reduce the influence of the imbalance of points in different semantic classes with saliency constraints.

Finally, a Saliency-guided LiDAR Odometry Network (SalLONet) is proposed by applying the point cloud saliency and semantic information for performance improvement of odometry model. The objective of LiDAR odometry estimation is to output the pose information by matching two point clouds, in other words, it can be regarded as a problem of registration between two LiDAR scans. Therefore, our saliency-guided LiDAR odometry model is motivated by two observations: (1) the dynamic points should be suppressed as much as possible, since they may decrease the performance of odometry estimation during the registration. On the other hand, (2) the static points should have more priority to make the model focus more on the salient static points for feature matching. To this end, we utilize our SalLiDAR model to predict semantics and the saliency map of the point cloud for odometry estimation. The semantic and saliency predictions are not only fed into the odometry model as input features but also integrated into a saliency-guided odometry loss to regularize the odometry model.

3.3 Proposed method

(1)
Learning point cloud saliency: In order to learn point cloud saliency representations, we adopt existing LiDAR based semantic segmentation models [8, 26, 47] as backbones of the feature extractor. As shown in Fig. 3b, given a 3D LiDAR point cloud with coordinates and corresponding point-wise features, we first feed it into the feature extractor to obtain the representations of each point. Next, these learned features are passed by a saliency prediction layer to output the saliency score map of the input point cloud. We considered two types of models to learn saliency distribution on point clouds: (i) classification-based saliency prediction and (ii) saliency regression. In this work, we choose to use the models trained based on saliency regression, as it is a commonly used approach and there is not a significant difference in performance between the two methods. More details can be referred to our previous work [38].
(2)
Two-stream segmentation model: As depicted in Fig. 5, we develop a two-stream semantic segmentation model on the point cloud by combining features from the saliency module and the semantic module. We feed an input point cloud into the saliency branch to predict the saliency distribution of the whole scene. Meanwhile, the point cloud is also fed into the semantic branch to extract point features and output the predictions of the semantic class. To validate the effectiveness of the learned point cloud saliency distribution knowledge, we initialize and freeze the parameters of the saliency branch with the weights pre-trained on the FordSaliency dataset. More details can be referred to our previous work [38].

Fig. 6
Architecture of Saliency-guided LiDAR Odometry Network (SalLONet). The ${\mathcal {{{C}}}}_{t}$, ${\mathcal {{{C}}}}_{t-1}$, ${\mathcal {{{S}}}}_{t}$, ${\mathcal {{{S}}}}_{t-1}$, ${\mathcal {{{R}}}}_{t}$, ${\mathcal {{{R}}}}_{t-1}$ represent the predicted LiDAR semantic maps, LiDAR saliency maps, and range images of corresponding two consecutive point clouds ${\mathcal {{P}}}_{t}$ and ${\mathcal {{P}}}_{t-1}$. The transformation $\hat{\varvec{T}}$ of the two point clouds consists of the estimated translation t and rotation q (i.e., pose)
Full size image
(3)
LiDAR odometry estimation module: As shown in Fig. 6, the semantic map and saliency map are first predicted by the proposed saliency and semantic modules of SalLiDAR. For odometry estimation, we convert and concatenate two consecutive LiDAR point clouds and their respective predicted saliency and semantic maps to range images as the input of the odometry module. The outputs of the odometry module are the feature vectors of translation ${\textbf {t}}$ and rotation ${\textbf {q}}$ between two LiDAR point clouds. Then we can construct the rigid body transformation ${\hat{T}}_{t-1, t}$ based on the predicted translation and rotation. The source LiDAR scan ${\mathcal {P}}_{t}$ can be transformed into $\hat{{\mathcal {P}}}_{t-1}$ by the transformation ${\hat{T}}_{t-1, t}$ for matching with the target LiDAR scan ${\mathcal {P}}_{t-1}$. Thus, the odometry module can be supervised by the point-wise matching errors between the transformed scan $\hat{{\mathcal {P}}}_{t-1}$ and the target scan ${\mathcal {P}}_{t-1}$.
Fig. 7
Demonstrations of semantics and saliency map for LiDAR odometry estimation. The top depicts the binarized semantics based on dynamic (e.g., car) and static point (e.g., building, traffic sign), and the bottom shows the point-wise matching with saliency maps between two consecutive scans
Full size image

To regularize the odometry module, we exploit the saliency and semantic segmentation information to saliency-guided odometry loss for the model training. The previous odometry studies [5, 6] have shown that dynamic points may degrade the performance of odometry estimation, thus we exploit the predicted semantic map to suppress the effect of dynamic points. On the other hand, we utilize the predicted saliency map to increase the priorities of static salient points for matching two LiDAR scans. To obtain the saliency and semantic predictions for the LiDAR odometry module, we initialize and freeze the parameters of the saliency branch with the weights pre-trained on our FordSaliency [38] dataset. We also use the weights learned from the SemanticKITTI [41] dataset to initialize and freeze the parameters of the semantic module. Since the predicted saliency distribution represents the attention level of each point, we can apply it to constrain the parameter optimization of the semantic segmentation module. Similar to the proposed SalLiDAR model, we propose three different integration methods with saliency semantic information for LiDAR odometry estimation:

Table 2 Assignment of dynamic (D$\rightarrow$0) and static (S$\rightarrow$1) points based on semantic categories defined in SemanticKITTI [41] dataset

Full size table

SalLONet-I: Saliency-guided odometry loss with saliency prediction and semantic mask for odometry estimation. If we remove saliency and semantic concatenation module in Fig. 6, it becomes SalLONet-I. Considering that the dynamic points (e.g., moving car, pedestrian) may introduce adverse results on the odometry estimation, we first convert the predicted semantic map to a binarized mask to indicate the static points (e.g., building, road) and dynamic points (e.g., car, person). As shown in Fig. 7 (top) and Table 2, the predicted semantics are binarized to dynamic and static points based on the semantic categories defined in SemanticKITTI [41] dataset. The point with a semantic class of moving or potentially moving is defined as a dynamic point. For example, regardless of whether the semantic category of a point is a car or a moving car, it will be defined as a dynamic point. Then, the binarized semantic mask is applied to suppress the adverse effects of the dynamic points and increase the weights of static points for the point-wise matching odometry loss. Additionally, we apply saliency prediction to odometry loss for odometry model training, thus facilitating the odometry model to focus more on the static salient points for matching, as shown in Fig. 7 (bottom).

By following the study [42], we use the geometric-based losses to optimize the odometry estimation module by calculating the point-wise matching errors:

$$\begin{aligned} {\mathcal {L}}^{odom} = \lambda \cdot {\mathcal {L}}_{p2n} + {\mathcal {L}}_{n2n} \end{aligned}$$

(3)

where $\lambda$ is a constant to balance the two losses. ${\mathcal {L}}_{p2n}$ and ${\mathcal {L}}_{n2n}$ are point-to-plane loss and plane-to-plane loss, which can be represented as follows:

$$\begin{aligned} {\mathcal {L}}_{p2n} = \frac{1}{N} \sum _{i=1}^{N} | (\hat{p}_{i} - {p}_{i}) \cdot {n}_{i} |_{2}^{2} \end{aligned}$$

(4)

$$\begin{aligned} {\mathcal {L}}_{n2n} = \frac{1}{N} \sum _{i=1}^{N} | \hat{n}_{i} - {n}_{i} |_{2}^{2} \end{aligned}$$

(5)

where $\hat{p}_{i}$ and ${p}_{i}$ are the point coordinate values of $\hat{P}$ and P, respectively. $\hat{n}_{i}$ and ${n}_{i}$ are the normal values of $\hat{{\mathcal {N}}}$ and ${\mathcal {N}}$, respectively.

To guide the odometry model to focus more on salient static points for matching, we apply the saliency and binarized semantic maps to the geometric-based loss. Thus, the saliency-guided odometry loss can be represented as:

$$\begin{aligned} \hat{{\mathcal {L}}}^{odom} = \frac{1}{N} \sum _{i=1}^{N} {l}^{odom}_{i} * {\mathcal {W}} \end{aligned}$$

(6)

$$\begin{aligned} {\mathcal {W}} = \exp {(s_{i}^{t}*{s}_{i}^{t-1}*{c}_{i}^{t}*{c}_{i}^{t-1})} \end{aligned}$$

(7)

where i is the index of point; $\hat{{\mathcal {L}}}^{odom}$ is the weighted loss for odometry; ${l}^{odom}_{i}$ denotes the point-wise odometry matching loss of point $p_{i}$. $s_{i}^{t}$ and ${s}_{i}^{t-1}$ denote the saliency prediction of point $p_{i}$ from scan t and scan $t-1$; $c_{i}^{t}$ and ${c}_{i}^{t-1}$ denote the binarized semantic predictions of point $p_{i}$ from scan t and scan $t-1$; the $*$ represents the element-wise multiplication. Figure 8 illustrates a visualization of the saliency and semantic maps used for attention-guided LiDAR odometry estimation. We can observe that the weights of dynamic points (e.g., car) are suppressed, while the static points (e.g., building) are highlighted by the binarized semantics and LiDAR saliency map.

SalLONet-II: Saliency prediction and semantic mask as descriptors for odometry estimation. If we remove point-wise saliency guided loss module in Fig. 6, it becomes SalLONet-II. We append the normalized saliency and binarized semantic prediction to point cloud coordinates as input features for the odometry model. We believe that prior saliency knowledge and high-level semantic information could be helpful for feature learning and localization in odometry estimation.

SalLONet-III: Saliency distribution and semantic prediction as descriptors and attentive loss guiding for odometry estimation. Fig. 6 shows the SalLONet-III model. In this model, the saliency and semantic maps are not only utilized as the additional input features for the odometry module but also applied to saliency-guided odometry loss for optimization during training.

4 Experimental analysis

4.1 Implementation details

For point cloud saliency detection, we employ PointNet [26], RandLA-Net [47], and Cylinder3D [8] models as feature extractors. For 3D semantic segmentation, we use RandLA-Net [47] and Cylinder3D [8] as baselines. For LiDAR odometry, we adopt the DeLORA [42] model as the baseline of LiDAR odometry estimation. By following follow DeLORA [42] model, the range images and normal features are converted from two consecutive raw LiDAR point clouds as input features for odometry estimation. We adopt our SalLiDAR model [38] to generate saliency and semantic predictions for the LiDAR odometry module. For a fair comparison, all the odometry networks of baseline and our proposed methods are randomly initialized, and they are trained on KITTI [50] odometry dataset from scratch with 5-fold validation. The initial learning rate is 1e-5, and all odometry models are trained with self-supervised learning for 100 epochs.

4.2 Datasets and experimental setup

4.2.1 LiDAR FordSaliency dataset

Based on the data of the FordCampus [37] dataset, we build a point cloud saliency dataset (namely FordSaliency) for the training of LiDAR-based saliency models. We utilize dataset1 and dataset2 of FordSaliency as the validation set and training set, respectively. More details can be referred to the work [38].

4.2.2 SemanticKITTI dataset

SemanticKITTI [41] dataset is a well-known large-scale dataset for point cloud semantic segmentation. This dataset consists of 22 Velodyne driving-scene sequences, which are split into a training set (sequences 00–07 and 09–10), a validation set (sequence 08), and a testing set (sequences 11–21).

4.2.3 KITTI odometry dataset

We conduct all the odometry experiments on KITTI [50] odometry dataset, which provides LiDAR point clouds captured from the Velodyne LiDAR sensor. By following the odometry works [5, 42], the dataset is divided into a training set (Sequences 00–08) and a validation set (Sequences 09–10).

4.3 Evaluation metrics

For point cloud saliency detection, we use popular saliency metrics^{Footnote 1} including Correlation Coefficient (CC), Similarity (SIM), and Kullback–Leibler Divergence (KLD) to evaluate the performance of point cloud saliency model. For performance evaluation of LiDAR semantic segmentation, we adopt mean Intersection-over-Union (mIoU)^{Footnote 2} as evaluation metric following the previous studies [8, 47]. For LiDAR odometry estimation, the average translational ($[\%]$) and average rotational ($[\frac{deg}{100\,m}]$) RMSE (Root Mean Square Error)^{Footnote 3} are adopted to evaluate the performance of LiDAR odometry models.

4.4 Results and performance analysis

4.4.1 LiDAR saliency results on FordSaliency dataset

We compare the performance of LiDAR-based saliency models with different feature extractors on our FordSaliency dataset. In Fig. 9, we show the visualization results of SalLiDAR models with different backbones on the FordSaliency validation set. In Table 3, we report the quantitative performance of these models on the FordSaliency validation set. From Fig. 9 and Table 3, we can observe that although the saliency annotations are pseudo-labels, all these LiDAR-based models are able to learn the discriminative point cloud saliency representations for saliency distribution prediction. On the other hand, the model with the Cylinder3D backbone can predict better saliency distribution than the model with other backbones. The models with RandLA-Net backbone and PointNet backbone can learn the correlation and similarity features from point cloud saliency annotations, as evidenced by the CC, SIM, and KLD values in Table 3. However, the prediction of the model with the Cylinder3D backbone can achieve higher CC, SIM, and lower KLD performance. It suggests that the model with a voxel-based partition (e.g. 3D Cylinder) could learn more powerful saliency representations than point-based models (Fig. 10).

Table 3 Results of SalLiDAR models with different backbones on FordSaliency dataset

Full size table

Table 4 Performance comparison of proposed models and existing LiDAR segmentation methods on SemanticKITTI [41] test set

Full size table

4.4.2 Semantic segmentation results on SemanticKITTI dataset

We report the LiDAR semantic segmentation performance on the test set of SemanticKITTI in Table 4. Note that all the testing performance results of Table 4 are taken from the literature and the benchmark leaderboard^{Footnote 4} of SemanticKITTI [41] dataset. By comparing with Table 3 and Table 4, we can find that the mIoU results on test sequences show the improved generalization ability in the larger set of evaluation samples. Compared to the baselines, all the models with SalLiDAR obtain better mIoU results. The proposed method also improves the segmentation performance on specific classes, since the combination of our predicted saliency distribution makes the model attentive to these categories, such as car, truck, and parking. Furthermore, the Cylinder3D model with SalLiDAR achieves better segmentation results than the RandLA-Net with SalLiDAR. It shows that the semantic segmentation model with better saliency prediction could provide more attentive information or features to improve the performance of the model. Especially, these experimental results demonstrate that the performance of LiDAR semantic segmentation models can be improved by proposed saliency distribution integration and point-wise attention-guided loss. These comparison results validate the effectiveness of the pre-trained point cloud saliency models, although they are trained on the FordSaliency dataset with pseudo-annotations.

Table 5 Comparison of translational ($[\%]$) and rotational ($[\frac{deg}{100\,m}]$) errors on validation set of KITTI [50] odometry dataset

Full size table

4.4.3 Odometry results on KITTI dataset

In Fig. 11 and Fig. 12, we show the experimental trajectory results of the proposed SalLONet models on Sequence 09–10 of KITTI [50] odometry dataset. We can observe that the SalLONet models with saliency and semantic information predict better trajectory results than the baseline model by comparing with the ground truth. In Table 5, we present the quantitative results of proposed approaches and six existing odometry methods. The DeepLO [59] and Velas et al. [60] are supervised LiDAR odometry models. In other words, the ground-truth poses of the training set are used to train these supervised odometry models. The DeLORA [42] is an unsupervised LiDAR odometry model. This means that the unsupervised DeLORA [42] does not require labels to train the model. By following the study [42], there are also three unsupervised visual odometry estimation methods [61,62,63] for comparison. As shown in Table 5, the three proposed SalLONet models improve the performance of the baseline model, as evidenced by lower translational and rotational errors in Sequence 09–10 of KITTI [50] odometry dataset. Among the unsupervised methods, the SalLONet-III achieves the best results with the lowest errors on both validation sequences. In particular, its translational error on Sequence 10 ($t_{\text {rel}}=4.940$) even outperforms the supervised DeepLO [59] method ($t_{\text {rel}}=5.020$). In a word, these experimental results show that the saliency and semantic information are effective for improving the odometry estimation tasks, which implicitly indicates the effectiveness of image-to-LiDAR saliency knowledge transfer.

Table 6 Comparison of translational ($[\%]$) and rotational ($[\frac{deg}{100\,m}]$) errors on validation set of KITTI [50] odometry dataset

Full size table

4.5 Ablation studies

We investigate the effectiveness of saliency and semantics for LiDAR odometry estimation. From Table 5, we observe that the SalLONet-III model achieves better results by leveraging both semantic and saliency cues. Thus, we conduct the ablation study based on SalLONet-III to verify the influences of saliency and semantic maps for LiDAR odometry estimation. In the proposed SalLONet-III method, we leverage saliency and semantic predictions for LiDAR odometry estimation simultaneously. We first validate the model with saliency information-only integration. We also train the model with semantic information only. The performance results of the ablation study are shown in Table 6. Experimental evaluation shows that the SalLONet model with both saliency and semantic information achieves superior performance on KITTI [50] odometry dataset. In addition, the model of SalLONet with saliency only outperforms better results against the baseline model, which demonstrates that saliency information is effective for improving LiDAR odometry estimation. The model of SalLONet with semantics only obtains competitive results by comparing it with the baseline model. However, the SalLONet model benefits from saliency and semantic information, thus getting the lowest translational and rotational errors on Sequence 09-10 of the KITTI [50] odometry dataset.

4.6 Limitations

The human visual system takes advantage of visual attention mechanisms to explore and analyse scenes. Inspired by this concept, we conducted an exploratory study to investigate how attention (i.e., saliency maps) can be integrated into vision tasks for point clouds. Therefore, this work primarily focuses on the accuracy improvement in point cloud vision tasks achieved by incorporating additional cues, such as saliency knowledge for semantic segmentation and saliency and semantics for odometry, rather than on the computational efficiency of the approach.

However, the SalLONet-III version requires saliency and segmentation prediction networks to run prior to odometry, using their outputs along with the point cloud data as inputs. This increases computational complexity depending on the selected models for saliency and segmentation predictions, which is a limitation despite the improvement in odometry accuracy compared to the baseline odometry model. It is worth noting that SalLONet-I does not require saliency and segmentation prediction as inputs to the odometry network; instead, it employs these cues only during training as loss weighting parameters. Thus, during inference, SalLONet-I operates similarly to the baseline model without additional complexity or computational cost, while still outperforming the baseline model.

Another limitation is the difficulty of evaluating and validating the FordSaliency dataset and the models trained for saliency maps. Because the point cloud saliency ground truths maps in the proposed FordSaliency dataset are pseudo-saliency values obtained from state-of-the-art image saliency models. Moreover, due to the sparse nature of point cloud data, limited image regions are projected or represented, resulting in missing salient and non-salient information that is available in the image domain. Therefore, aside from the saliency evaluation metrics comparing predictions with pseudo ground truth values, the most feasible way to verify the effectiveness of point cloud saliency predictions is to test them on saliency-guided vision tasks as proposed in this work. In addition to these points, we hope this work opens new research directions for exploring these limitations and inspires new ideas to solve or minimize them.

5 Conclusion

This article has presented the research works on establishing LiDAR-based saliency detection models with image-to-LiDAR transfer learning for improving the performance of 3D point cloud understanding tasks. We propose a Saliency-guided LiDAR Odometry Network (SalLONet) by combining saliency and semantic information of point clouds. First, the saliency and semantic maps generated by the proposed two-stream semantic model are fed into the odometry module as the feature representation of the input consecutive point clouds. Second, the saliency and semantic predictions are applied to odometry loss. To alleviate the effect of dynamic points for pose regression, we binarize the semantic prediction to dynamic and static points based on the semantic class. Then the binarized semantics are utilized to filter the dynamic points by point-wise multiplication for loss weighting. To further encourage the odometry module to learn discriminative features, the saliency map is leveraged to increase the loss weights of salient static points for matching two-point clouds. Extensive experimental results on KITTI [50] odometry dataset have demonstrated outstanding performance of the proposed odometry model with saliency and semantic information, which considers the influences of dynamic and static salient points for pose estimation simultaneously.^{Footnote 5}

Data availability

Source code and data will be available at https://github.com/nevrez/SalLONet upon preparation and approval. We are planning to provide the pre-trained weights used in this work. In addition, we are also hoping to share the FordSaliency data, which only includes the respective saliency values for each point cloud scan in FordCampus Dataset [37]. In case, saliency dataset itself is not available, the source codes can be utilized to generate the point cloud saliency in FordSaliency dataset used in this work by processing the FordCampus Dataset [37].

Notes

References

Cadena, C., Carlone, L., Carrillo, H., Latif, Y., Scaramuzza, D., Neira, J., Reid, I., Leonard, J.J.: Past, present, and future of simultaneous localization and mapping: toward the robust-perception age. IEEE Trans. Rob. 32(6), 1309–1332 (2016)
Article Google Scholar
Wang, K., Ma, S., Chen, J., Ren, F., Lu, J.: Approaches, challenges, and applications for deep visual odometry: toward complicated and emerging areas. IEEE Trans. Cogn. Dev. Syst. 14(1), 35–49 (2022)
Article Google Scholar
Zheng, X., Zhu, J.: Efficient LiDAR odometry for autonomous driving. IEEE Rob. Automat. Lett. 6(4), 8458–8465 (2021)
Article MathSciNet Google Scholar
Xu, Y., Huang, Z., Lin, K.-Y., Zhu, X., Shi, J., Bao, H., Zhang, G., Li, H.: Selfvoxelo: Self-supervised LiDAR odometry with voxel-based deep neural networks. In: Conference on Robot Learning, pp. 115–125. PMLR (2021)
Li, Q., Chen, S., Wang, C., Li, X., Wen, C., Cheng, M., Li, J.: Lo-net: deep real-time LiDAR odometry. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8473–8482 (2019)
Wang, G., Wu, X., Liu, Z., Wang, H.: Pwclo-net: deep LiDAR odometry in 3d point clouds using hierarchical embedding mask optimization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15910–15919 (2021)
Hou, Y., Zhu, X., Ma, Y., Loy, C.C., Li, Y.: Point-to-voxel knowledge distillation for LiDAR semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8479–8488 (2022)
Zhu, X., Zhou, H., Wang, T., Hong, F., Li, W., Ma, Y., Li, H., Yang, R., Lin, D.: Cylindrical and asymmetrical 3d convolution networks for LiDAR based perception. IEEE Trans. Pattern Anal. Mach. Intell. 44(10), 6807–6822 (2022)
Article Google Scholar
Jiang, L., Xu, M., Wang, X., Sigal, L.: Saliency-guided image translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16509–16518 (2021)
Wang, W., Lai, Q., Fu, H., Shen, J., Ling, H., Yang, R.: Salient object detection in the deep learning era: an in-depth survey. IEEE Trans. Pattern Anal. Mach. Intell. 44(6), 3239–3259 (2022)
Article Google Scholar
Liu, C., Ding, W., Yang, J., Murino, V., Zhang, B., Han, J., Guo, G.: Aggregation signature for small object tracking. IEEE Trans. Image Process. 29, 1738–1747 (2019)
Article MathSciNet Google Scholar
Zhou, Z., Pei, W., Li, X., Wang, H., Zheng, F., He, Z.: Saliency-associated object tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9866–9875 (2021)
Tasse, F.P., Kosinka, J., Dodgson, N.: Cluster-based point set saliency. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 163–171 (2015)
Tinchev, G., Penate-Sanchez, A., Fallon, M.: Skd: keypoint detection for point clouds using saliency estimation. IEEE Robot. Autom. Lett. 6(2), 3785–3792 (2021)
Article Google Scholar
Kim, H., Joung, S., Kim, I.-J., Sohn, K.: Prototype-guided saliency feature learning for person search. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4865–4874 (2021)
Ren, X., Zhang, D., Bao, X., Zhang, Y.: S²-net: Semantic and salient attention network for person re-identification. IEEE Trans. Multimedia 25, 4387–4399 (2023)
Article Google Scholar
Zhao, R., Oyang, W., Wang, X.: Person re-identification by saliency learning. IEEE Trans. Pattern Anal. Mach. Intell. 39(2), 356–370 (2016)
Article Google Scholar
Ding, X., Lin, W., Chen, Z., Zhang, X.: Point cloud saliency detection by local and global feature fusion. IEEE Trans. Image Process. 28(11), 5379–5393 (2019)
Article MathSciNet Google Scholar
Shtrom, E., Leifman, G., Tal, A.: Saliency detection in large point sets. In: IEEE International Conference on Computer Vision, pp. 3591–3598 (2013)
Zheng, T., Chen, C., Yuan, J., Li, B., Ren, K.: Pointcloud saliency maps. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1598–1606 (2019)
Lowe, D.G.: Object recognition from local scale-invariant features. In: Proceedings of the Seventh IEEE International Conference on Computer Vision, vol. 2, pp. 1150–1157. IEEE (1999)
Zheng, C., Lyu, Y., Li, M., Zhang, Z.: Lodonet: A deep neural network with 2d keypoint matching for 3d LiDAR odometry estimation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2391–2399 (2020)
Liang, H.-J., Sanket, N.J., Fermüller, C., Aloimonos, Y.: Salientdso: bringing attention to direct sparse odometry. IEEE Trans. Autom. Sci. Eng. 16(4), 1619–1626 (2019)
Article Google Scholar
Engel, J., Koltun, V., Cremers, D.: Direct sparse odometry. IEEE Trans. Pattern Anal. Mach. Intell. 40(3), 611–625 (2017)
Article Google Scholar
Prakhya, S.M., Bingbing, L., Weisi, L., Qayyum, U.: Sparse depth odometry: 3d keypoint based pose estimation from dense depth data. In: 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 4216–4223. IEEE (2015)
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660 (2017)
Chen, X., Milioto, A., Palazzolo, E., Giguere, P., Behley, J., Stachniss, C.: Suma++: efficient LiDAR based semantic slam. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4530–4537. IEEE (2019)
Wang, J., Rünz, M., Agapito, L.: DSP-SLAM: object oriented SLAM with deep shape priors. In: 2021 International Conference on 3D Vision (3DV), pp. 1362–1371. IEEE (2021)
Li, Z., Wang, N.: Dmlo: deep matching LiDAR odometry. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6010–6017. IEEE (2020)
Chen, G., Wang, B., Wang, X., Deng, H., Wang, B., Zhang, S.: PSF-LO: Parameterized semantic features based LiDAR odometry. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 5056–5062. IEEE (2021)
Besl, P.J., McKay, N.D.: A method for registration of 3-d shapes. IEEE Trans. Pattern Anal. Mach. Intell. 14(2), 239–256 (1992)
Article Google Scholar
Pomerleau, F., Colas, F., Siegwart, R., Magnenat, S.: Comparing icp variants on real-world data sets: open-source library and experimental protocol. Auton. Robot. 34, 133–148 (2013)
Article Google Scholar
Zhang, J., Singh, S.: LOAM: LiDAR odometry and mapping in real-time. In: Robotics: Science and Systems, vol. 2, pp. 1–9. Berkeley, CA (2014)
Wang, H., Wang, C., Chen, C.-L., Xie, L.: F-LOAM: fast LiDAR odometry and mapping. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4390–4396. IEEE (2021)
Chen, X., Saparov, A., Pang, B., Funkhouser, T.: Schelling points on 3d surface meshes. ACM Trans. Graph. (TOG) 31(4), 1–12 (2012)
Article Google Scholar
Fan, S., Gao, W., Li, G.: Salient object detection for point clouds. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII, pp. 1–19. Springer (2022)
Pandey, G., McBride, J.R., Eustice, R.M.: Ford campus vision and LiDAR data set. Int. J. Robot. Res. 30(13), 1543–1552 (2011)
Article Google Scholar
Ding, G., Imamoglu, N., Caglayan, A., Murakawa, M., Nakamura, R.: SalLiDAR: Saliency knowledge transfer learning for 3d point cloud understanding. In: 33rd British Machine Vision Conference (BMVC), pp. 1–14 (2022)
Meng, Q., Wang, W., Zhou, T., Shen, J., Jia, Y., Van Gool, L.: Towards a weakly supervised framework for 3d point cloud object detection and annotation. IEEE Trans. Pattern Anal. Mach. Intell. 44(8), 4454–4468 (2022)
Google Scholar
Yin, J., Fang, J., Zhou, D., Zhang, L., Xu, C.-Z., Shen, J., Wang, W.: Semi-supervised 3d object detection with proficient teachers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision—ECCV 2022, pp. 727–743. Springer, Cham (2022)
Chapter Google Scholar
Behley, J., Garbade, M., Milioto, A., Quenzel, J., Behnke, S., Stachniss, C., Gall, J.: Semantickitti: A dataset for semantic scene understanding of LiDAR sequences. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9297–9307 (2019)
Nubert, J., Khattak, S., Hutter, M.: Self-supervised learning of LiDAR odometry for robotic applications. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 9601–9607. IEEE (2021)
Cho, Y., Kim, G., Kim, A.: Unsupervised geometry-aware deep LiDAR odometry. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 2145–2152. IEEE (2020)
Jonnavithula, N., Lyu, Y., Zhang, Z.: LiDAR odometry methodologies for autonomous driving: a survey. arXiv preprint arXiv:2109.06120 (2021)
Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 20(11), 1254–1259 (1998)
Article Google Scholar
Droste, R., Jiao, J., Noble, J.A.: Unified image and video saliency modeling. In: European Conference on Computer Vision, pp. 419–435. Springer (2020)
Hu, Q., Yang, B., Xie, L., Rosa, S., Guo, Y., Wang, Z., Trigoni, N., Markham, A.: Randla-Net: Efficient semantic segmentation of large-scale point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11108–11117 (2020)
Milioto, A., Vizzo, I., Behley, J., Stachniss, C.: RangeNet++: Fast and accurate LiDAR semantic segmentation. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4213–4220. IEEE (2019)
Xu, J., Zhang, R., Dou, J., Zhu, Y., Sun, J., Pu, S.: Rpvnet: a deep and efficient range-point-voxel fusion network for LiDAR point cloud segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16024–16033 (2021)
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. IEEE (2012)
Zhang, Y., Zhou, Z., David, P., Yue, X., Xi, Z., Gong, B., Foroosh, H.: Polarnet: An improved grid representation for online LiDAR point clouds semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9601–9610 (2020)
Xu, C., Wu, B., Wang, Z., Zhan, W., Vajda, P., Keutzer, K., Tomizuka, M.: Squeezesegv3: Spatially-adaptive convolution for efficient point-cloud segmentation. In: European Conference on Computer Vision, pp. 1–19. Springer (2020)
Cortinhal, T., Tzelepis, G., Erdal Aksoy, E.: Salsanext: Fast, uncertainty-aware semantic segmentation of LiDAR point clouds. In: International Symposium on Visual Computing, pp. 207–222. Springer (2020)
Thomas, H., Qi, C.R., Deschaud, J.-E., Marcotegui, B., Goulette, F., Guibas, L.J.: Kpconv: Flexible and deformable convolution for point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6411–6420 (2019)
Zhang, F., Fang, J., Wah, B., Torr, P.: Deep fusionnet for point cloud semantic segmentation. In: European Conference on Computer Vision, pp. 644–663. Springer (2020)
Kochanov, D., Nejadasl, F.K., Booij, O.: Kprnet: Improving projection-based LiDAR semantic segmentation. arXiv preprint arXiv:2007.12668 (2020)
Gerdzhev, M., Razani, R., Taghavi, E., Bingbing, L.: Tornado-net: multiview total variation semantic segmentation with diamond inception module. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 9543–9549. IEEE (2021)
Tang, H., Liu, Z., Zhao, S., Lin, Y., Lin, J., Wang, H., Han, S.: Searching efficient 3d architectures with sparse point-voxel convolution. In: European Conference on Computer Vision, pp. 685–702. Springer (2020)
Cho, Y., Kim, G., Kim, A.: Deeplo: geometry-aware deep LiDAR odometry. arXiv preprint arXiv:1902.10562 (2019)
Velas, M., Spanel, M., Hradis, M., Herout, A.: Cnn for imu assisted odometry estimation using velodyne LiDAR. In: 2018 IEEE International Conference on Autonomous Robot Systems and Competitions (ICARSC), pp. 71–77. IEEE (2018)
Zhu, A.Z., Liu, W., Wang, Z., Kumar, V., Daniilidis, K.: Robustness meets deep learning: An end-to-end hybrid pipeline for unsupervised learning of egomotion. arXiv preprint arXiv:1812.08351 (2018)
Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1851–1858 (2017)
Li, R., Wang, S., Long, Z., Gu, D.: Undeepvo: Monocular visual odometry through unsupervised deep learning. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7286–7291. IEEE (2018)

Download references

Acknowledgements

This paper is in part based on the results obtained from a project commissioned by the New Energy and Industrial Technology Development Organization (NEDO), Japan. This work was supported by JST SPRING, Grant Number JPMJSP2124. Computational resource of AI Bridging Cloud Infrastructure (ABCI; https://abci.ai/) provided by the National Institute of Advanced Industrial Science and Technology (AIST) was used for training and testing the models during our experiments.

Author information

Authors and Affiliations

Graduate School of Science and Technology, University of Tsukuba, Tsukuba, 305-8577, Ibaraki, Japan
Guanqun Ding & Masahiro Murakawa
Digital Architecture Research Center (DigiARC), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, 135-0064, Japan
Guanqun Ding, Nevrez İmamoğlu & Ali Caglayan
CNRS-AIST Joint Robotics Laboratory (JRL), National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba, 305-8560, Ibaraki, Japan
Nevrez İmamoğlu
Artificial Intelligence Research Center (AIRC), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, 135-0064, Japan
Masahiro Murakawa & Ryosuke Nakamura

Authors

Guanqun Ding
View author publications
You can also search for this author in PubMed Google Scholar
Nevrez İmamoğlu
View author publications
You can also search for this author in PubMed Google Scholar
Ali Caglayan
View author publications
You can also search for this author in PubMed Google Scholar
Masahiro Murakawa
View author publications
You can also search for this author in PubMed Google Scholar
Ryosuke Nakamura
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

G.D.and N.I did Conceptualization, Methodology, Software, Investigation, Visualization. G.D. did writing the original draft. A.C. and N.I and M.M and R.N did the Supervision, Discussion on the Methods, and Investigations. N. I. and M. M. and R. N. did the Budget Acquisition and Project Management. All authors Reviewed the manuscript.

Corresponding author

Correspondence to Nevrez İmamoğlu.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Communicated by Chenggang Yan.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (mp4 44698 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Ding, G., İmamoğlu, N., Caglayan, A. et al. Attention-guided LiDAR segmentation and odometry using image-to-point cloud saliency transfer. Multimedia Systems 30, 188 (2024). https://doi.org/10.1007/s00530-024-01389-7

Download citation

Received: 18 December 2023
Accepted: 10 June 2024
Published: 24 June 2024
DOI: https://doi.org/10.1007/s00530-024-01389-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Attention-guided LiDAR segmentation and odometry using image-to-point cloud saliency transfer

Abstract

Similar content being viewed by others

AdVLO: Region Selection via Attention-Driven for Visual LiDAR Odometry

LESS: Label-Efficient Semantic Segmentation for LiDAR Point Clouds

Learning 3D Semantics From Pose-Noisy 2D Images with Hierarchical Full Attention Network

1 Introduction

2 Related works

2.1 LiDAR odometry estimation

2.2 Saliency detection on point cloud

2.3 LiDAR semantic segmentation

3 Proposed framework

3.1 Problem formulation

3.2 Framework overview

3.3 Proposed method

4 Experimental analysis

4.1 Implementation details

4.2 Datasets and experimental setup

4.2.1 LiDAR FordSaliency dataset

4.2.2 SemanticKITTI dataset

4.2.3 KITTI odometry dataset

4.3 Evaluation metrics

4.4 Results and performance analysis

4.4.1 LiDAR saliency results on FordSaliency dataset

4.4.2 Semantic segmentation results on SemanticKITTI dataset

4.4.3 Odometry results on KITTI dataset

4.5 Ablation studies

4.6 Limitations

5 Conclusion

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Supplementary Information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation