1 Introduction

Robust perception of the dynamic environment is a fundamental task for many real-world applications such as autonomous driving, robot navigation, augmented reality, and human-robot interaction systems. The goal of scene flow is to estimate 3D displacement vectors between two consecutive scenes, representing all observed points in the scene as a dense or semi-dense 3D motion field. Therefore, scene flow can serve as an upstream step in high-level challenging computer vision tasks such as object tracking, odometry, action recognition, etc. With prior knowledge of the camera’s intrinsic parameters, the 3D scene flow can be projected onto the image plane to obtain its 2D counterpart in pixel coordinates, which is called optical flow.

Many approaches propose pixel-wise scene flow estimation using stereo image sequences by combining geometry reconstruction with optical flow estimation to obtain dense scene flow (Franke et al., 2005; Huguet and Devernay, 2007; Wedel et al., 2008; Ilg et al., 2018; Chen et al., 2020; Schuster et al., 2020, 2018; Saxena et al., 2019). Despite significant advances in such approaches, the overall accuracy of the resulting scene flow is highly dependent on the image quality, which can be poor under adverse lighting conditions. Compared to stereo systems, LiDAR sensors can accurately capture 3D geometry in the form of 3D point clouds and are less sensitive to lighting conditions. Therefore, there is an increasing emphasis on estimating scene flow directly from 3D point clouds.

Handling point clouds and finding correspondences in 3D space is more challenging due to the irregularity, sparsity of points, and varying point density of the scene. To tackle these challenges, several techniques develop deep neural architectures to estimate scene flow from point clouds. Some of these methods project the point cloud onto a permutohedral lattice (Gu et al., 2019) and then use bilateral convolutions (Jampani et al., 2016). Others organize 3D point clouds into voxels (Gojcic et al., 2021; Li et al., 2022) and use sparse convolutions (Choy et al., 2019) to facilitate scene flow prediction. However, these regular representations can introduce discretization artifacts and information loss that negatively affect the accuracy of the network.

With the advent of point-based networks on 3D point clouds (Wu et al., 2019; Qi et al., 2017b), many works estimate scene flow directly from raw point clouds using the multi-layer perceptron (MLP) as in Liu et al. (2019), Wei et al. (2021), Wu et al. (2020), Kittenplon et al. (2021), Wang et al. (2021a), Gu et al. (2022), Wang et al. (2022a, 2022b), Cheng and Ko (2022) without the need for regular or intermediate representations. All of these techniques build a flow embedding module at coarse resolutions and then either use hierarchical refinement modules along with upsampling (Cheng and Ko, 2022; Liu et al., 2019; Wu et al., 2020; Wang et al., 2021a, 2022b, a) or use gated recurrent units (GRUs) (Cho et al., 2014) with iterative flow updating for the refinement process (Wei et al., 2021; Kittenplon et al., 2021; Gu et al., 2022). Despite their ability to capture both near and far matches, GRU-based methods (Wei et al., 2021; Kittenplon et al., 2021; Gu et al., 2022) are less efficient in terms of runtime due to iterative flow updates along with expensive flow embedding layers.

Fig. 1
figure 1

Our RMS-FlowNet++ shows an accurate scene flow (Acc3DR) with a low runtime. The accuracy is tested on \(\textrm{KITTI}_{\textrm{s}}\) (Menze and Geiger, 2015) with 8192 points as input and the runtime is analyzed for all methods equally on a Geforce GTX 1080 Ti

Following hierarchical schemes, Wang et al. (2022b) propose a double attentive flow embedding along with the explicit learning of the residual scene flow. Their extension in Wang et al. (2022a) further improves the results by jointly learning of backward constraints in the all-to-all flow embedding layer to capture distant matches. However, both methods Wang et al. (2022a, 2022b) reduce the scene flow resolution to a quarter of the input points, and they show that obtaining high accuracy of the scene flow at the full input resolution requires a further refinement module. This can be computationally expensive, while using a simple interpolation method can degrade the overall accuracy. In addition, the use of an all-to-all flow embedding layer in Wang et al. (2022a) increases the size of the correlation volume, which in turn increases the computational load of further operations. More recently, Bi-PointFlowNet Cheng and Ko (2022) has proposed to learn bidirectional correlations from coarse-to-fine, searching for correspondences in both directions, and it uses a flow embedding layer at full input resolution.

All of the above point-based methods use Farthest-Point-Sampling (FPS) and rely on a K-Nearest-Neighbor (KNN) search with a large set of correspondences during the flow embedding, which both increases the computational and memory requirements and limits the ability to handle large point clouds.

Fig. 2
figure 2

The challenges of Random-Sampling (RS) (right) compared to Farthest-Point-Sampling (FPS) (left): Both techniques sample two consecutive scenes \(P^t\) (blue) and \(Q^{t+1}\) (green) into red and pink samples, respectively. Areas of low density are often not sufficiently covered by Random-Sampling (RS), resulting in dissimilar patterns. The patterns of the corresponding objects are much more similar when FPS is used, making it easier to match the points (Color figure online)

To tackle these challenges, we present our RMS-FlowNet++ – a hierarchical point-based learning approach that requires smaller correspondence sets compared to the state-of-the-art methods and outperforms them when using FPS on the KITTI (Menze and Geiger, 2015) data set. In addition, our model allows the use of RS and is therefore more efficient, has a smaller memory footprint, and shows comparable results at a lower runtime compared to the other state-of-the-art methods (cf. Fig. 1).

The advantage of RS combined with the smaller correspondence set results in our model being the only one that can robustly estimate scene flow on a very large set of points, as shown in Sect. 4.5. However, using RS for scene flow estimation is challenging for two main reasons: (1.) RS will reflect the spatial distribution of the input point cloud, which is problematic if it is far from uniform, which is a disadvantage compared to FPS. (2.) Corresponding (rigid) areas between consecutive point clouds will be sampled differently by RS, while FPS will yield more similar patterns. Both issues are illustrated in Fig. 2.

To overcome these problems, we propose a novel Patch-to-Dilated-Patch flow embedding consisting of four layers with lateral connections (cf. Fig. 5) to incorporate a larger receptive field during matching without increasing the physical set of correspondences. Overall, our fully supervised architecture consists of a hierarchical feature extraction, an optimized flow embedding, and scene flow prediction at multiple scales. The preliminary version of our network design has been published in RMS-FlowNet (Battrawy et al., 2022), but we are improving the overall design, which will lead to a very accurate result with higher efficiency. Our contribution can be summarized as follows:

  • We propose RMS-FlowNet++ – an end-to-end scene flow estimation network that operates on dense point clouds with high accuracy.

  • Our network consists of a hierarchical scene flow estimation with a novel flow embedding module (called Patch-to-Dilated-Patch) which is suitable for the combination with Random-Sampling.

  • Compared to our previous work in RMS-FlowNet (Battrawy et al., 2022), we significantly reduce the size of the correspondence set, and omit some layers in the feature extraction module to increase the overall efficiency. Furthermore, we show that a feature-based search can increase the overall accuracy without sacrificing the efficiency.

  • We explore the advantages of RS over FPS on high-density point clouds and its ability to generalize during the inference.

  • We provide an intensive benchmark showing our strong results in terms of accuracy, generalization, and runtime compared to previous methods.

  • Finally, we investigate the robustness of our network to occlusions and evaluate it for points acquired at longer distances (> 35 m).

2 Related Work

3D scene flow was first introduced by Vedula et al. (1999), developed using image-based (e.g., RGB-D) setups (Hadfield and Bowden, 2011; Hornacek et al., 2014; Quiroga et al., 2014; Jaimez et al., 2015a, b; Sun et al., 2015), and then further developed in advanced deep learning networks Shao et al. (2018); Teed and Deng (2021). Since RGB-D sensors can only perceive depth at short distances, there have been many works that estimate scene flow from stereo images by jointly estimating disparity and optical flow (Ilg et al., 2018; Jiang et al., 2019; Ma et al., 2019; Chen et al., 2020). However, two-view geometry has inherent limitations in self-driving cars, such as inaccuracies in disparity estimation in distant regions. It can also suffer from poor lighting conditions, such as in dark tunnels. Our work focuses on learning scene flow directly from point clouds, without relying on RGB images.

Fig. 3
figure 3

We describe the generic pipeline of recent scene flow estimation methods. Like our previous work Battrawy et al. (2022), our RMS-FlowNet++ estimates scene flow directly from raw point clouds and extracts features based on RandLA-Net (Hu et al., 2020). Compared to recent scene flow methods, our novel Patch-to-Dilated-Patch allows the use of RS along with hierarchical or coarse-to-fine refinement

Scene Flow from Point Clouds: With the recent advent of LiDAR sensors, which provide highly accurate 3D geometry of the environment for autonomous driving and robot navigation, it becomes increasingly important to estimate scene flow directly from point clouds in 3D world space. In this context, there is some work Dewan et al. (2016), Ushani et al. (2017) that formulates the task of scene flow estimation from point clouds as an energy optimization problem without taking advantage of deep learning. Advances in deep learning on 3D point clouds (Qi et al., 2017a, b; Su et al., 2018; Wu et al., 2019) make neural networks more attractive and accurate for 3D scene flow estimation than the traditional methods (Ushani and Eustice, 2018; Wang et al., 2018; Liu et al., 2019; Behl et al., 2019; Gu et al., 2019; Wu et al., 2020; Puy et al., 2020; Wei et al., 2021; Kittenplon et al., 2021; Li et al., 2021; Wang et al., 2021a; Gojcic et al., 2021; Gu et al., 2022; Cheng and Ko, 2022). These recent methods mostly follow a general scene flow estimation pipeline as shown in Fig. 3, but differ in how they represent point clouds, extract features, design the cost volume, or apply the refinement strategy. For example, with the breakthrough architecture of PointNet++ (Qi et al., 2017b), many works estimate scene flow directly from raw point clouds in an end-to-end fashion (Liu et al., 2019; Puy et al., 2020; Wei et al., 2021; Kittenplon et al., 2021; Wang et al., 2021b, a; Gu et al., 2022; Wang et al., 2022b, a; Dong et al., 2022). Based on PointNet++, FlowNet3D (Liu et al., 2019) is the first work to introduce a novel flow embedding layer. However, its accuracy is limited because there is only a single flow embedding layer and the correlation in local regions relies on the nearest spatial neighbor search, which may fail for long-range motion (i.e., distant matches). In an attempt to overcome the limitations of FlowNet3D, many approaches introduce hierarchical scene flow estimation, iterative unrolling methods, or work under rigidity assumptions.

Hierarchical Scene Flow: HPLFlowNet (Gu et al., 2019) introduces multi-scale correlation layers by projecting the points into a permutohedral lattice as in SplatNet (Su et al., 2018) and applying bilateral convolutional layers (BCL) (Kiefel et al., 2014; Jampani et al., 2016). Despite of the efficiency of HPLFlowNet (Gu et al., 2019) on high-density point clouds, but the accuracy of the network is prone to unavoidable errors due to the splatting and slicing process. PointPWC-Net (Wu et al., 2020) avoids the grid representation in Gu et al. (2019) and improves the scene flow accuracy on raw point clouds based on PointConv (Wu et al., 2019) by regressing multi-scale flows from coarse-to-fine. Following the hierarchical point-based designs, HALFlow (Wang et al., 2021a) uses the point feature learning of PointNet++ (Qi et al., 2017b), but proposes a hierarchical attention learning flow embedding with double attentions leading to better results than PointPWC-Net (Wu et al., 2020). Further improvements are proposed in Wang et al. (2022b, 2022a) to develop the flow embedding of Wang et al. (2021a) through explicit prediction of residual flow (Wang et al., 2022b) and backward reliability validation (Wang et al., 2022a). All previous methods take advantage of FPS for downsampling to provide accurate scene flow estimation, but at the cost of efficiency, especially for dense points. Compared to these methods, our network solves the challenge of using RS to work with high-density points.

Iterative Unrolling for Scene Flow: Besides hierarchical flow embedding schemes, a new trend started in FLOT (Puy et al., 2020), inspired by Li et al. (2019), Monga et al. (2021), Teed and Deng (2020), to iteratively refine the scene flow by unrolling a fixed number of iterations to globally optimize an optimal transport map (Titouan et al., 2019). PV-RAFT (Wei et al., 2021), FlowStep3D (Kittenplon et al., 2021), and RCP (Gu et al., 2022) extend unrolling techniques from optimization problems to learning-based models by using gated recurrent units (GRUs) (Cho et al., 2014) and capturing both local and global correlations. We find that iterative unrolling with a fixed number of iterations and repeated use of flow re-embedding works well at low input resolution, but is inefficient compared to hierarchical designs.

Rigidity Assumption for Scene Flow: Axiomatic concepts of explicit rigidity assumptions with ego-motion estimation are explored in Gojcic et al. (2021), Dong et al. (2022). A plug-in refinement module is proposed by HCRF-Flow (Li et al., 2021), which uses high-order conditional random fields (CRFs) to refine the scene flow by applying the rigidity condition at the region level. Our RMS-FlowNet++ is free of any rigidity constraint, so it can work with non-rigid bodies, such as pedestrians.

Flow Embedding on Point Clouds: The irregular data structure of point clouds makes it difficult to build cost volumes as with image-based solutions (Ilg et al., 2017; Sun et al., 2018; Teed and Deng, 2020). Therefore, previous works such as Gu et al. (2019), Kittenplon et al. (2021), Liu et al. (2019), Wu et al. (2020) design complicated flow embedding layers to aggregate the matching costs from consecutive point clouds.

Patch-to-Point Correlation: FlowNet3D (Liu et al., 2019) introduces the flow embedding in a patch-to-point manner, which means that the set of neighboring correspondences in the target point cloud set are grouped into the source one based on the Euclidean space. Then, the correlations are learned using multi-layer perceptron (MLP) followed by max-pooling to aggregate the features of the correspondence set.

Patch-to-Patch Correlation: To incorporate a large field of correlations leading to better accuracy, HPLFlowNet (Gu et al., 2019) proposes a multi-scale patch-to-patch design that takes advantage of the regular representation using the permutohedral lattice (Kiefel et al., 2014; Jampani et al., 2016). Apart from regular representations, PointPWC-Net (Wu et al., 2020) uses a patch-to-point flow embedding layer to aggregate the features of the correspondence set in the adjacent frames based on the point-wise continuous convolution in PointConv (Wu et al., 2019). A point-to-patch embedding is then applied to aggregate the set of neighbor correspondences in the source itself. Instead of using the backbone of PointConv (Wu et al., 2019), HALFlow (Wang et al., 2021a) uses a two-stage of attention mechanism to softly weight the neighboring correspondence features and allocate more attention to the regions with correct correspondences. With two-stage attentions, hierarchical and explicit learning of the residual scene flow is proposed by Wang et al. (2022b) to reduce the inconsistencies between the correlations and to handle fast-moving objects. Bi-PointFlowNet (Cheng and Ko, 2022) uses the patch-to-patch mechanism in a bidirectional manner across all multi-scale layers, which requires intensive computation of forward-backward KNNs and additional refinement at full input resolution. Compared to Cheng and Ko (2022), our network finds reliable correlations under bidirectional constraints using the cosine similarity matrix at the coarse scale, and then operates unidirectionally at the upper scales, requiring a small number of correlations defined by the KNN search. It also makes our solution more efficient without sacrificing accuracy by eliminating the need to refine at full input resolution.

All-to-All Correlations: There are several approaches that compute a global cosine similarity based on latent features and then learn soft correlations by iteratively refining an optimal transport problem using the non-parametric Sinkhorn algorithm as in FLOT (Puy et al., 2020). WSLR Gojcic et al. (2021) uses Sinkhorn, but refines the scene flow at the object-level based on the rigidity assumption and in combination with ego-motion estimation. Apart from object-level refinement, FlowStep3D (Kittenplon et al., 2021) computes an initial global correlation matrix, and then uses gated recurrent units (GRUs) for local region refinement to iteratively align point clouds. Another GRU-based method is proposed by PV-RAFT (Wei et al., 2021), but its flow embedding design combines point-based and voxel-based features to preserve fine-grained information while encoding large correspondence sets at the same time. WM3D (Wang et al., 2022a) designs all-to-all correlations, supported by reliability validation, but used only at low scale resolution, where each point in the source uses all points in the target for correlation, and thus each point in the target can therefore obtain the correlation with all points in the source. Then, two-stage attentive flow embedding as in Wang et al. (2021a, 2022b) is used to aggregate reliable correspondences among this large set of correspondences. All previous methods significantly increase the number of correlation candidates, which makes the refinement operations computationally intensive, especially when the input point clouds contain a large number of points. For this reason, some approaches, such as Wang et al. (2021a, 2022a, 2022b), limit the scene flow estimation to the quarter resolution of the input points to avoid further computation. Compared to Wang et al. (2021a, 2022a, 2022b), we design four stages in our flow embedding, ending with two stages of attention for final correlations refinements in a large receptive field. It also allows us to get the full resolution of the scene flow exactly the same as the resolution of the input points.

RMS-FlowNet Battrawy et al. (2022): Our preliminary network has proposed a trade-off between efficiency and accuracy by replacing the FPS sampling technique with the computationally cheap RS technique. In RMS-FlowNet, we have proposed a novel flow embedding design, called as Patch-to-Dilated-Patch, with three embedding steps to solve the challenges of using RS. In addition, we significantly optimize the correspondence search based on KNN by using the Nanoflann framework (Blanco and Rai, 2014),Footnote 1 which further increases the efficiency. Compared to RMS-FlowNet (Battrawy et al., 2022), we change the architectural design in our RMS-FlowNet++ to speed up the feature extraction module by eliminating the upsampling part (i.e., decoder) (cf. Fig. 4) and the dense layers at the full input resolution in the encoder part. And in terms of accuracy, we add another search step in the flow embedding based on the feature space to improve the overall accuracy, and we also improve the way of pairwise correspondence search at the coarse scale under a bidirectional constraint. With these improvements, our RMS-FlowNet++ becomes much more accurate while still showing high efficiency.

3 Network Design

Our RMS-FlowNet++ estimates scene flow as translational vectors from consecutive frames of point clouds (e.g., from LiDAR or RGB-D sensors), with no assumptions about object rigidity or direct estimation of sensor motion within the environment (i.e., no direct estimation of ego-motion).

Fig. 4
figure 4

Our network design consists of feature extraction, flow embedding, warping layers, and scene flow heads, similar to our previous work RMS-FlowNet (Battrawy et al., 2022). Compared to the feature extraction module in RMS-FlowNet, which consists of fully connected layers (FC) at full input resolution, encoder and decoder modules (a), we omit (FC) and the decoder in our RMS-FlowNet++ (b)

Given Cartesian 3D point cloud frames \({P^{t}=\{p^t_{i}\in \mathbb {R}^3\}}^N_{i=1}\) and \({Q^{t+1}=\{q^{t+1}_{j}\in \mathbb {R}^3\}}^M_{j=1}\) at timestamps t and \(t+1\), respectively, our goal is to estimate point-wise 3D flow vectors \({S^t =\{s^t_{i}\in \mathbb {R}^3\}}^N_{i=1}\) for each point within the reference frame \(P^{t}\) (i.e., \(s^t_{i}\) is the motion vector for \(p^t_{i}\)). The sizes (N, M) of the two frames do not have to be identical, and the two frames should not have exact correspondences between their points. Our network is designed to estimate scene flow at multi-scale levels through hierarchical feature extraction using a novel design of flow embedding, called Patch-to-Dilated-Patch, with warping layers and scene flow estimation heads.

The components of each module are described in detail in the following sections.

3.1 Feature Extraction Module

The feature extraction module consists of two pyramid networks with shared parameters for the hierarchical extraction of two feature sets from \(P^t\) and \(Q^{t+1}\). Unlike our previous work in RMS-FlowNet (Battrawy et al., 2022), the design of this module includes only the encoder parts, while no decoder and no transposed convolutions are required to upsample the extracted features to the full resolution, as shown in Fig. 4.

The encoder part computes a hierarchy of features at four scales \(\{l_k\}^3_{k=0}\) from fine-to-coarse resolution, where \(l_0\) is the full resolution of (\(P^{t}\) and \(Q^{t+1}\)) and the resolutions of the downsampled scales are fixed to \({\{\{l_k\}}^3_{k=1}\mid l_1 = 2048, l_2 = 512, l_3 = 128\}\) during training, but are kept adaptive at higher point densities (cf. Sect. 4.5). Each scale is essentially composed of two layers, where Local-Feature-Aggregation (LFA) is applied to aggregate the features at the \(l_{k}\) scale, followed by Downsampling (DS) to aggregate the features from the \(l_{k}\) level to \(l_{k+1}\), resulting in a decrease in resolution. Inspired by RandLA-Net (Hu et al., 2020), which focuses only on semantic segmentation, we use the feature aggregation layer of LFA, which consists of three neural units: (1) local spatial encoding to encode the geometric and relative position features, (2) attentive pooling to aggregate the set of neighbor features, and (3) a dilated residual block.

To apply LFA, we search for the number of nearest neighbors (\(K_p\)) at all scales using KNNs search in Euclidean space and aggregate the features with two attentive pooling layers designed as in Hu et al. (2020), where the attentive pooling unit is based on the mechanism of self-attention (Yang et al., 2020; Zhang and Xiao, 2019). DS samples the points to the defined resolution in layer \(l_{k+1}\) and aggregates the nearest neighbors (\(K_p\)) from the higher resolution \(l_{k}\) by using max-pooling. During training and evaluation with RS, \(K_{p}\) is set to 20 in all layers and changed to 16 during evaluation with FPS.

The feature extraction module outputs two feature sets over all scales \(\{F_{k}^{t}\in \mathbb {R}^{c_k}\}^3_{k=0}\) and \(\{F_{k}^{t+1}\in \mathbb {R}^{c_k}\}^3_{k=0}\) for \(\{P_{k}^{t}\in \mathbb {R}^{3}\}^3_{k=0}\) and \(\{Q_{k}^{t+1}\in \mathbb {R}^{3}\}^3_{k=0}\), respectively. Here, \(c_k\) is the feature dimension, which is fixed as \({\{\{c\}}^3_{k=0}\mid c_0 = 32, c_1 = 128, c_2 = 256, c_3 = 512\}\). The feature extraction module of our RMS-FlowNet++ is shown in Fig. 4b compared to our preliminary design in Fig. 4a. The design of LFA and DS allows the use of RS but still requires well-designed flow embedding to ensure robust scene flow.

3.2 Flow Embedding

A flow embedding module across consecutive frames is the key component for correlating the adjacent frames of point clouds, where finding reliable correlations is extremely important for encoding 3D motion. In this context, previous state-of-the-art methods combine: (1) grouping of correspondences from \(Q^{t+1}\), (2) robust aggregation of correspondence features into \(P^{t}\), 3) refinement of flow embedding.

Fig. 5
figure 5

Our novel Flow-Embedding (FE) module consists of four main steps and yields the scene flow feature \(sf_i^{t}\): Two maximum embedding layers based on both Euclidean and feature space followed by two attentive embedding layers. Lateral connections are also used: A Concatenation (Concat.) between the first two embeddings and a residual connection (ResConn.)

All point-wise learning-based methods take an advantage of FPS (as explained in Fig. 2) to sample the consecutive frames and rely on finding large correspondence sets (\(\geqslant 32\) matches) in the sampled \(Q^{t+1}\) based on KNN as in Wang et al. (2021a), Wu et al. (2020), Cheng and Ko (2022) or much more in all-to-all correlations as in Wang et al. (2022a). While FPS can generate similar patterns across consecutive scenes to facilitate obtaining strong match pairs, finding large correspondence sets can increase the likelihood of correlating distant matches (i.e., for large displacements) Wang et al. (2022b, 2022a). In addition, some work aims to add refinement at the fine resolution of the input points (Wu et al., 2020; Cheng and Ko, 2022). Taken together, this can increase accuracy, but it reduces the overall efficiency of these methods and limits their ability to handle high point densities. In addition, considering large correspondence sets can greatly increase the possibility of aggregating unreliable correlations, leading to inaccurate estimates.

To address these issues, we develop a special flow embedding module that has two advantages over current point-based methods: First, a smaller correspondence set is used without the need for flow embedding at full resolution, and second, the use of RS is possible. As a result, we speed up our model and make it amenable to RS, as shown by our results in Sect. 4, which allows higher point densities with low memory requirements (cf. Fig. 7). We must recall that, the use of RS is more challenging than FPS (cf. Fig. 2) for two reasons. First, regions with low local point density are underrepresented when using RS. Second, the sampling patterns for corresponding regions are less correlated across frames.

Our novel and efficient flow embedding, called Patch-to-Dilated-Patch, aggregates large correspondence sets without increasing the physical number of the nearest neighbors. This is basically the same design as in our previous work RMS-FlowNet (Battrawy et al., 2022), but we apply some changes that are outlined in Table 4. In this context, we search for correspondences not only in Euclidean space, but also in feature space, and we add another embedding step.

Matches Search: Grouping strong correspondences is the first step in any flow embedding. Many state-of-the-art methods search for the set of matches based on Euclidean space, but apply soft weights in different ways. Since the grouping of correlations based on Euclidean space may not be sufficient to capture distant matches, we use the feature space to find reliable matches at the coarse scale (last down-sampled layer) \(l_{3}\).

Point-to-Point Bidirectional Map: For the above reasoning, we compute a simple cosine similarity matrix based on the feature space to find a pair of matches under bidirectional constraint that applies a point-to-point (i.e., one-to-one) correlation map. Based on the above reasoning, \(q^{t+1}_{j}\) in \(Q^{t+1}\) is a true match to \(p^t_{i}\) in \(P^{t}\) if the highest similarity score is guaranteed in a bidirectional way, otherwise the search for matches is done in Euclidean space. Finding robust matches at the coarse scale leads to a high quality initial estimate of the scene flow at scale \(l_{3}\). This also approximates the distant matches at the upper scales using the warping layer, so that \(p^t_{i}\) is close to its match in \(Q^{t+1}\).

Patch-to-Point Search: For the upper scales \(\{l\}^2_{k=1}\), it is not worth computing the cosine similarity, since it is difficult to get distinctive features in a one-to-one manner at high point densities, and it is worth searching for the number of closest matches \(p^t_{i}\) within \(Q^{t+1}\) based on the Euclidean space, denoted by \(\mathcal {N}_Q(p^t_{i})\).

Graph Representation: After finding the likelihood correspondences, we construct the correlations in a graph form \(\mathcal {G}=(\mathcal {V}, \mathcal {E})\), where \(\mathcal {V}\) and \(\mathcal {E}\) are the vertices and edges, respectively. Then, we apply multi-layer perceptron (MLP):

$$\begin{aligned} h^{\mathcal {E}}_{\Theta }(v_i) = MLP(\{[v_i, v_j-v_i] \mid (i,j) \in \mathcal {E}\}) \end{aligned}$$
(1)

where ([., .] denotes the concatenation, \(v_i\) is a central vertex feature, and \(v_j-v_i\) denotes the edge features. This representation is compared in Table 4 with the original form in our preliminary work RMS-FlowNet (Battrawy et al., 2022), which omits the \(v_{i}\) part in Eq. (1) and keeps only the edge features.

Flow Embedding Steps: Having found the number of possible matches (i.e., \(\mathcal {N}_Q(p^t_{i})\)) within the representation in the form described above, we apply the flow embedding aggregation steps at each scale \(l_k\) for every \(\textrm{i}^{\textrm{th}}\) element within the downsampled scales \(\{l\}^3_{k=1}\), except the full-resolution scale \(l_{0}\), as follows:

  • \(1^{st}\) Embedding (Patch-to-Point): We first apply max-pooling to the output of Eq. (1) and obtain \(e_{1i}^{t}\) as shown in the following equation:

    $$\begin{aligned} e_{1i}^{t} = \underset{{f_{j}^{t+1}\in \mathcal {N}_Q (p_{i}^{t})}}{MAX}(h^{\mathcal {N}_Q(p_{i}^{t})}_{\Theta }(f_i^{t})) \end{aligned}$$
    (2)
  • \(2^{nd}\) Embedding (Patch-to-Point): Compared to RMS-FlowNet, we add another embedding step to group correspondences that are semantically similar by applying KNN in feature space, inspired by the backbone of DGCNN (Wang et al., 2019). For this purpose, we group the number of nearest neighbors \(\mathcal {N}_E(e_{1i}^{t})\) for each output element of Eq. (2) based on the feature space and apply the graph form, where \(\mathcal {N}_E\) denotes the neighboring features of \(e_{1i}^{t}\). Next, we apply max-pooling to obtain \(\hat{e}_{2i}^{t}\), which is then channel-wise concatenated with \(e_{1i}^{t}\), followed by a multi-layer perceptron (MLP) to obtain \(e_{2i}^{t}\):

    $$\begin{aligned} \hat{e}_{2i}^{t}&= \underset{{e_{1j}^{t}\in \mathcal {N}_E (e_{1i}^{t})}}{MAX}(h^{\mathcal {N}_E (e_{1i}^{t})}_{\Theta }(e_{1i}^{t})), \end{aligned}$$
    (3)
    $$\begin{aligned} e_{2i}^{t}&= MLP([e_{1i}^{t}, \hat{e}_{2i}^{t}]) \end{aligned}$$
    (4)
  • \(3^{rd}\) Embedding (Point-to-Patch): Using channel-wise concatenation, we combine the feature \(f_i^t\) of \(p_i^t\) with the output of Eq. (4) on the coarse scale \(l_3\), and further combine the upsampled scene flow feature \(sf_{i}^{t}\) (computed by Eq. (10)) and the upsampled scene flow \(s_{i}^{t}\) (computed by the scene flow head in Sect. 3.3) on the upper scales as follows:

    $$\begin{aligned} \hat{f}_{i}^{t} = [f_{i}^{t}, e_{2i}^{t}, sf_{i}^{t}, s_{i}^{t}] \end{aligned}$$
    (5)

    Then, we group the nearest features \(\hat{f}_{i}^{t}\) based on the Euclidean search \(\mathcal {N}_P(p_i^t)\) (\(p_i^t\) is the 3D spatial location of \(\hat{f}_{i}^{t}\)), then compute the attention weights \(w_{1i}^{t}\) and sum the weighted features to obtain \(e_{3i}^{t}\):

    $$\begin{aligned} w_{1i}^{t}&= g(\hat{f}_{i}^{t}, {{\varvec{W}}}), \end{aligned}$$
    (6)
    $$\begin{aligned} e_{3i}^{t}&= \sum _{n=1}^{K_{p}} (\hat{f}_{n}^{t}\cdot w_{1n}^{t}) \end{aligned}$$
    (7)

    where g() consists of a shared MLP with trainable weights W followed by softmax. With this attention mechanism, high attention is paid to the well-correlated features, while the less correlated features are suppressed. The attention-based mechanism is generally inspired by Yang et al. (2020), Zhang and Xiao (2019).

  • \(4^{th}\) Embedding (Point-to-Dilated-Patch): It repeats the previous step on the output result \(e_{3i}^{t}\) with new attention weights \(w_{2i}^{t}\) for the nearest features based on Euclidean space. This embedding layer results in an increased receptive field embedding \(e_{4i}^{t}\):

    $$\begin{aligned} w_{2i}^{t}&= g(e_{3i}^{t}, {{\varvec{W}}}), \end{aligned}$$
    (8)
    $$\begin{aligned} e_{4i}^{t}&= \sum _{n=1}^{K_{p}} (e_{3n}^{t}\cdot w_{2n}^{t}) \end{aligned}$$
    (9)

    where g() consists of a shared MLP with trainable weights W followed by softmax. Technically, we aggregate features from a larger range by repeating the aggregation mechanism without physically increasing of the number of the nearest neighbors, inspired by Hu et al. (2020).

Finally, to improve the quality of our flow embedding, we add a residual connection (Res. Conn.), which is an element-wise summation of \(e_{2i}^{t}\) and \(e_{4i}^{t}\), resulting in the scene flow feature \(sf_{i}^{t}\) (cf. Fig. 5):

$$\begin{aligned} sf_{i}^{t} = e_{2i}^{t} + e_{4i}^{t} \end{aligned}$$
(10)

Note that for all of the above flow embedding steps, we need to group a certain number of features (\(K_p\)) and aggregate their features either by max-pooling or by attention. \(K_p\) is set to 20 in all layers with RS and changed to 16 during the evaluation with FPS. We found that training with RS generalizes well with FPS without any fine-tuning (cf. Table 7). In addition, we must emphasize that the third and forth embedding steps do not require a new KNN search because we reuse the predefined neighbors of the feature extraction module. Together, these four steps lead to our novel Patch-to-Dilated-Patch embedding, which is described in Fig. 5. In this way, we are able to obtain a larger receptive field with a small number of nearest neighbors, which is computationally more efficient.

Experiments with the \(1^{st}\), \(3^{rd}\), and \(4^{th}\) flow embedding steps were performed in our previous work (Battrawy et al., 2022), and we explore our feature-based search in the additional \(2^{nd}\) embedding step in Table 4.

3.3 Multi-scale Scene Flow Estimation

Our RMS-FlowNet++ predicts scene flow at multiple scales, inspired by PointPWC-Net (Wu et al., 2020), but we consider significant changes in conjunction with RS to make our prediction more efficient. Our scene flow prediction over all scales consists of two WLs, three FEs, four scene flow estimators, and Upsampling (US) modules, as shown in Fig. 6. Compared to the design of PointPWC-Net (Wu et al., 2020) and Bi-PointFlowNet (Cheng and Ko, 2022), we save one element from each category and do not use FE at full input resolution. As a result, we speed up our model without sacrificing accuracy, as shown in Fig. 1. The multi-scale estimation starts at the coarse resolution by predicting \(S_3^t\) with a scene flow estimation module after an initial FE. The scene flow estimation head takes the resulting scene flow features in Eq. (10) and applies three layers of MLPs with 64, 32, and 3 output channels, respectively. Then, the estimated scene flow and the upcoming features from FE are upsampled to the next higher scale using one nearest neighbor based on KNN search (i.e., \(K_q=1)\). We use the same strategy to upsample the scene flow from the \(l_1\) scale to the full input resolution \(l_0\) without any additional FE.

Fig. 6
figure 6

Multi-scale scene flow prediction with three Flow-Embedding (FE) modules (each consists of four steps), two Warping-Layers (WLs), four scene flow estimators and Upsampling (US) layers

Our Warping-Layer uses the upsampled scene flow \(S_k^t\) at scale level \(l_k\) to warp \(P_k^{t}\) toward \(Q_k^{t+1}\) to obtain \({\widetilde{P}}_k^{t+1}\). This forward warping process does not require any further KNN search because the predicted scene flow is associated with \(P_k^{t}\). This is more efficient compared to PointPWC-Net Wu et al. (2020) or Bi-PointFlowNet (Cheng and Ko, 2022), which must first associate the scene flow with \(Q_k^{t+1}\) using KNN search in order to warp \(Q_k^{t+1}\) to \(P_k^{t}\) in the backward direction.

3.4 Loss Function

The model is fully supervised at multiple scales, similar to PointPWC-Net (Wu et al., 2020). If \(S_k^t\) is the predicted scene flow and the ground truth is \(S_{GT,k}^t\) at scale \(l_k\), then the loss can be written as follows:

$$\begin{aligned} \mathcal {L}(\theta ) = \sum _{k=0}^{3} {\alpha }_k \sum _{i=1}^{l_k} {\Vert s_{ki}^t - s_{GT,ki}^t \Vert }_2, \end{aligned}$$
(11)

where \({\Vert .\Vert }_2\) denotes the \(L_2\)-norm and the weights per level are \(\{\{{\alpha }_k\}^3_{k=0} \mid {\alpha }_0 = 0.02, {\alpha }_1 =0.04, {\alpha }_2 = 0.08, {\alpha }_3 = 0.16\}\).

4 Experiments

We conduct several experiments to evaluate the results of our RMS-FlowNet++ for scene flow estimation. First, we demonstrate its accuracy and efficiency compared to the state-of-the-art methods. Second, we verify our design choice with several analyses.

4.1 Evaluation Metrics

For a fair comparison, we use the same evaluation metrics as in Liu et al. (2019), Wu et al. (2020); Wang et al. (2021a), Gu et al. (2019), Puy et al. (2020), Kittenplon et al. (2021), Wei et al. (2021), Gu et al. (2022), Wang et al. (2022b), Cheng and Ko (2022), Wang et al. (2022a). Let \(S^t\) denotes the predicted scene flow, and \(S_{GT}^t\) denotes the ground truth scene flow. The evaluation metrics are averaged over all points and computed as follows:

  • EPE3D [m]: The 3D end-point error computed in meters as \({\Vert S^t-S_{GT}^t\Vert }_2\).

  • Acc3DS [%]: The strict 3D accuracy which is the ratio of points whose EPE3D \(< 0.05~m\) or relative error \(< 5\%\).

  • Acc3DR [%]: The relaxed 3D accuracy which is the ratio of points whose EPE3D \(< 0.1~m\) or relative error \(< 10\%\).

If a metric is subscripted with “\(_{\textrm{noc}}\)”, only the non-occluded points are evaluated, otherwise all input points are considered.

4.2 Data Sets and Preprocessing

As with state-of-the-art methods, we use the original version \(\textrm{FT3D}_{\textrm{o}}\) and the subset version \(\textrm{FT3D}_{\textrm{s}}\) of the established large-scale synthetic data set FlyingThings3D (FT3D) (Mayer et al., 2016). The subset version \(\textrm{FT3D}_{\textrm{s}}\) differs from the original \(\textrm{FT3D}_{\textrm{o}}\) by excluding some frames from the original and adding more labels. This data set provides the ground truth for scene flow represented as disparity changes over consecutive frames, disparity maps for consecutive frames and optical flow components, so that the 3D translation vector of the ground truth for scene flow can be computed. In addition, the \(\textrm{FT3D}_{\textrm{s}}\) subset provides occlusion maps on the basis of disparity, future and past motions.

In contrast to FT3D, the KITTI data set (Menze and Geiger, 2015) is a small data set with optical flow labels that consists of real outdoor scenes for autonomous driving applications and provides sparse disparity maps generated by a LiDAR sensor. The given second disparity map at timestamp \(t+1\) has been aligned with the first frame at timestamp t, allowing the computation of 3D translation vectors for scene flow.

Point Clouds Generation: Since the existing scenes of data sets and labels do not provide a direct representation of point clouds (i.e., 3D Cartesian locations), the state-of-the-art methods (Liu et al., 2019; Wu et al., 2020; Wang et al., 2021a; Gu et al., 2019; Puy et al., 2020; Kittenplon et al., 2021; Wei et al., 2021; Gu et al., 2022; Wang et al., 2022b; Cheng and Ko, 2022; Wang et al., 2022a) basically generate 3D point cloud scenes for their models using the given calibration parameters in the data sets. The generated point clouds are randomly subsampled to be evaluated at a certain resolution (e.g., 8192 points) and shuffled to dissolve correlations between consecutive point clouds. For this, we use the preprocessing strategies of the pioneering work in FlowNet3D (Liu et al., 2019) and HPLFlowNet (Gu et al., 2019), the latter of which yields non-occluded and exact correlations between the scenes. We also use other preprocessing strategies to ensure that there are no exact correlations between consecutive point clouds. To do this, we use the given consecutive disparity maps in \(\textrm{FT3D}_{\textrm{s}}\), and for the KITTI data set, we use the de-warped disparity maps of the second frame at timestamp \(t+1\) generated by Battrawy et al. (2019), Rishav et al. (2020).

All of the above preprocessing mechanisms differ in how the second point cloud \(Q^{t+1}\) is generated, which either results in exact correlations or not, and whether occluded points are considered or not. This is summarized in Table 1 and described in more detail below.

Preprocessing in HPLFlowNetGu et al. (2019)Footnote 2: This preprocessing considers the complete set of FlyingThings3D subset (\(\hbox {FT3D}_{\textrm{s}}\)), which consists of 19640 labeled scenes available in the training set and all 3824 frames available in the test split for evaluation. Unlike the FlowNet3D preprocessing (Liu et al., 2019), this preprocessing removes all the occluded points using occlusion maps provided in \(\hbox {FT3D}_{\textrm{s}}\) and the second point cloud frame \(Q^{t+1}\) is generated from disparity change and optical flow labels. For the KITTI data set, the 142 labeled scenes of the training split available in the raw KITTI data are preprocessed. The second frame of the point cloud \(Q^{t+1}\) is generated from the second disparity map by warping in 3D space, but without occlusion handling. The data generated from KITTI by this preprocessing is referred to as \(\textrm{KITTI}_{\textrm{s}}\). The preprocessing in HPLFlowNet results in an exact correlation between the consecutive point clouds and occluded points are not taken into account.

Preprocessing in FlowNet3DLiu et al. (2019)Footnote 3: Here, the original version of FlyingThings3D (\(\hbox {FT3D}_{\textrm{o}}\)) is used, with 20,000 images from the training set and 2,000 images from the test set randomly selected for training and evaluation, respectively. During preprocessing, many occluded points are included in the data and an occlusion mask is computed for \(P^{t}\), since there are no predefined occlusion maps in \(\hbox {FT3D}_{\textrm{o}}\). The frames of the point clouds \(P^{t}\) and \(Q^{t+1}\) are generated directly from the consecutive disparity maps and there are no exact correlations between the consecutive scenes. For the KITTI data set, this preprocessing considers 150 frames with occlusions, but does not compute an occlusion mask. The second frame of the point cloud \(Q^{t+1}\) is generated using the second disparity map by a warping process in 3D space with occlusion handling. The data generated from KITTI by this preprocessing is referred to as \(\textrm{KITTI}_{\textrm{o}}\).

Preprocessing with Occlusion Masks: In contrast to the preprocessing in HPLFlowNet (Gu et al., 2019), we generate both point clouds (\(P^{t}\) and \(Q^{t+1}\)) directly from the consecutive disparity maps of the \(\hbox {FT3D}_{\textrm{s}}\), resulting in very low correlations and existing occlusions in the scenes. By using the occlusion maps for disparity change and optical flow of consecutive scenes in forward and backward direction provided by \(\hbox {FT3D}_{\textrm{s}}\), we omit most of the occlusions in consecutive frames, leaving very few occluded points in all frames. These remaining occlusions are due to imperfections in the occlusion masks, and are referred to as partial occlusions. We also generate the same data without filtering any of the occlusions. This version is referred to as large occlusions. The preprocessed data from \(\hbox {FT3D}_{\textrm{s}}\) with partial or large occlusions are referred to as \(\textrm{FT3D}_{\textrm{so}}\).

To generate decorrelated points in KITTI, where the given disparity maps of \(t+1\) are aligned to the reference view at timestamp t, we use the preprocessing mechanism proposed in Battrawy et al. (2019), Rishav et al. (2020). In this preprocessing, the ground truth of the optical flow is used to generate \(Q^{t+1}\) through a pixel-by-pixel de-warping process for each disparity map of \(t+1\) aligned with the reference view, which largely dissolves the correlations between the point cloud scenes. We consider the 142 labeled scenes of the training split available in the raw KITTI data for preprocessing. The de-warped disparity maps can be downloaded directly from the source code of DeepLiDARFlow (Rishav et al., 2020),Footnote 4 and are used to compute the point clouds for both frames (i.e., \(P^{t}\) and \(Q^{t+1}\)). Given the occlusion maps in KITTI, we can either omit or include occluded points to create partial or large occlusions. The data generated from KITTI by this preprocessing is referred to as \(\textrm{KITTI}_{\textrm{d}}\).

Table 1 The preprocessing mechanisms of the data sets differ in how the second point cloud \(Q^{t+1}\) is generated and whether occluded points are considered or not
Table 2 When training with non-occluded data on \(\textrm{FT3D}_{\textrm{s}}\), we evaluate 8192 points in non-occluded scenes from FT3Ds (Mayer et al., 2016) and KITTIs (Menze and Geiger, 2015), and with partial and large occlusions from FT3Dso and KITTId

4.3 Implementation, Training and Augmentation

As in related work, we train our model twice; once with non-occluded data from \(\textrm{FT3D}_{\textrm{s}}\), considering all frames in the train split of \(\hbox {FT3D}_{\textrm{s}}\) (Mayer et al., 2016), and a second time with \(\textrm{FT3D}_{\textrm{o}}\), containing 20,000 frames with largely occluded points. During training, the preprocessed data is randomly subsampled to 8192 points, where the order of the points is random and the correlation between consecutive frames is dissolved by random selection. Following related work, we remove points with depths greater than 35 meters, which retains the majority of moving objects contained.

We use the Adam optimizer with default parameters and train the final version of our model with RS for 1260 epochs. The final model generalizes well with both RS and FPS sampling methods, and no further training with FPS is required (cf. Table 7). However, to speed up some experiments, we also train with FPS for 420 epochs, which converges faster than the training with RS. When we report the results of the model trained with FPS, we highlight (\(^*\)) next to FPS to distinguish it from the final model trained with RS.

We apply an exponentially decaying learning rate that is initialized at 0.001 and then decreases at a decaying rate of 0.8 every 20 epochs when training with FPS and every 60 epochs when training with RS.

We add two types of augmentation: First, we add geometric augmentation, i.e., points are randomly rotated by a small angle around the X, Y, and Z axes, and a random translational offset is added to increase the ability of our model to generalize to KITTI (Menze and Geiger, 2015) without fine-tuning. Second, when training with non-occluded data from \(\hbox {FT3D}_{\textrm{s}}\), each high-resolution frame is randomly sampled to 8192 points in each epoch differently. However, we do not consider this type of augmentation with the \(\hbox {FT3D}_{\textrm{o}}\) because the processed data set using the established preprocessing strategy in FlowNet3D (Liu et al., 2019) contains only 8192 points. The augmentation increases the ability of our model to generalize to the KITTI data set without fine-tuning (cf. Table 8).

Table 3 Occluded points are taken into account during training and inference

4.4 Comparison to State-of-the-Art

To demonstrate the accuracy and generalization of our model, we compare it with state-of-the-art methods in Table 2. The white cells denote the evaluation on the non-occluded \(\textrm{FT3D}_{\textrm{s}}\) as usually done in related work. The results within the light and dark gray cells denote the evaluation with partially and extensively occluded scenes of \(\textrm{FT3D}_{\textrm{so}}\). Our RMS-FlowNet++ allows the use of RS and shows very comparable results to the use of FPS, but with lower runtime (cf. Fig. 1), especially for higher resolution points (cf. Fig. 7).

Evaluation on \(\textbf{FT3D}_{\textbf{s}}\): We test our RMS-FlowNet++ on non-occluded data from \(\hbox {FT3D}_{\textrm{s}}\), as shown in the white cells in Table 2. Processing the entire points with an all-to-all correlation (i.e., global correlation) using an optimal transport solver in FLOT (Puy et al., 2020) shows significantly lower accuracy than the hierarchical mechanism of our RMS-FlowNet++ using RS. This confirms our decision to design our model in a hierarchical way. The RS version of our RMS-FlowNet++ significantly outperforms the regular representative methods as in Gu et al. (2019), Gojcic et al. (2021), Li et al. (2022). This supports our decision to handle the raw points without intermediate representations for scene flow estimation. Furthermore, our RMS-FlowNet++ with RS outperforms GRU-based methods Wei et al. (2021), Kittenplon et al. (2021), Gu et al. (2022) with a lower runtime (cf. Fig. 1).

Compared to hierarchical designs that basically use FPS, our RMS-FlowNet++ with RS outperforms (Liu et al., 2019; Wu et al., 2020; Li et al., 2021; Wang et al., 2021a) on all metrics and is highly competitive with very recent methods (Cheng and Ko, 2022; Wang et al., 2022a, b) at lower runtime as shown in Fig. 1. However, the FPS version of our RMS-FlowNet++ outperforms the methods of Cheng and Ko (2022), Wang et al. (2022b) and shows very comparable results to Wang et al. (2022a) with slight differences.

Moreover, our improvements in RMS-FlowNet++ are significant in both RS and FPS sampling versions, even for a small number of correspondences (i.e., \(K_p\) is set to 20 with RS and 16 with FPS), compared to our preliminary work RMS-FlowNet Battrawy et al. (2022), which uses a correspondence set of 33 points.

Generalization to \(\textbf{KITTI}_{\textbf{s}}\): We test the generalization ability to the KITTI data set (Menze and Geiger, 2015) without fine-tuning, as shown in Table 2, where the white cells denote the scores on \(\textrm{KITTI}_{\textrm{s}}\). Our RMS-FlowNet++ shows a stronger generalization ability with both sampling techniques, RS and FPS, than all state-of-the-art methods. This is best indicated by the much smaller gap in scores between the synthetic \(\textrm{FT3D}_{\textrm{s}}\) and the realistic \(\textrm{KITTI}_{\textrm{s}}\) results. With both sampling techniques, our RMS-FlowNet++ outperforms all the methods of Liu et al. (2019), Wu et al. (2020), Wang et al. (2021a), Gu et al. (2019), Puy et al. (2020), Kittenplon et al. (2021), Wei et al. (2021), Gu et al. (2022), Wang et al. (2022b). Compared to the competing methods in Cheng and Ko (2022), Wang et al. (2022a), RS with our RMS-FlowNet++ shows comparable results, but with lower runtime (cf. Fig. 1), and FPS outperforms these methods for all metrics.

Robustness to Occlusions: Training with non-occluded points using \(\textrm{FT3D}_{\textrm{s}}\) shows that our RMS-FlowNet++ is able to estimate a reasonable accuracy of scene flow on the test split data when evaluated on \(\textrm{FT3D}_{\textrm{so}}\), as shown in the light and dark gray cells in Table 2. When evaluated on the \(\textrm{FT3D}_{\textrm{so}}\), the scores of all input points are reported, taking into account the partial or large number of occluded points. In the evaluation of the \(\textrm{FT3D}_{\textrm{so}}\) test split with occlusions, the FPS and RS versions of our RMS-FlowNet++ take second and third place, respectively, behind the method in Wang et al. (2022a).

Fig. 7
figure 7

Analysis of accuracy and runtime on \(\hbox {FT3D}_{\textrm{s}}\) for different numbers of input points compared to state-of-the-art methods

On \(\textrm{KITTI}_{\textrm{d}}\), i.e. including occluded points in the evaluation, the FPS version of our RMS-FlowNet++ shows the best results of all methods in all metrics, and the faster RS version of our method shows comparable results to competing methods (Cheng and Ko, 2022; Wang et al., 2022a) when evaluated.

Table 3 shows the results on \(\textrm{FT3D}_{\textrm{o}}\) and \(\textrm{KITTI}_{\textrm{o}}\), where occluded points are additionally considered during training. For a fair comparison, we follow the other method’s evaluation scheme and include the occlusions during inference and evaluate over all input points in the EPE3D metric. In the \(\textrm{EPE3D}_{\textrm{noc}}\), \(\textrm{Acc3DS}_{\textrm{noc}}\), and \(\textrm{Acc3DR}_{\textrm{noc}}\) metrics, we ignore the occluded points when computing the scores, but still include them as input. For both data sets, \(\textrm{FT3D}_{\textrm{o}}\) and \(\textrm{KITTI}_{\textrm{o}}\), our RMS-FlowNet++ with RS ranks right behind the competing methods (Cheng and Ko, 2022; Wang et al., 2022a), but with FPS we rank second for \(\textrm{FT3D}_{\textrm{o}}\) and first for \(\textrm{KITTI}_{\textrm{o}}\).

4.5 Varying Point Densities

We evaluate the two versions of our RMS-FlowNet++, i.e. with RS and FPS, against our earlier work Battrawy et al. (2022) and the competing methods (Cheng and Ko, 2022; Wang et al., 2022a) in terms of accuracy (Acc3DR) and runtime at different input densities. The results of this comparison are shown in Fig. 7. We consider a wide range of densities \({N = \{4096 * 2^i\}^7_{i=0}}\) of \(\textrm{FT3D}_{\textrm{s}}\), and finally all available non-occluded points are evaluated, which corresponds to \(\sim 225\)K points on average. For a fair comparison, all methods are trained with a fixed resolution of 8192 points only, and we do not consider fine-tuning or further training with different point densities. We measure the inference time for all methods equally on a Geforce GTX 1080 Ti, including our RMS-FlowNet++ and RMS-FlowNet (Battrawy et al., 2022) with RS.

For both RS and FPS versions of our method, to keep the accuracy stable for densities \(>32\)K, we increase the resolution rate of the downsampling layers (cf. Sect. 3.1) to \({\{\{l\}}^3_{k=1}\mid l_1 = 4096, l_2 = 1024, l_3 = 256\}\) without any further training or fine-tuning. Furthermore, for the RS of our RMS-FlowNet++ for densities \(>131\)K, we increase the resolution rate of the downsampling layers to \({\{l\}}^3_{k=1}\mid l_1=8192, l_2=2048, l_3=512\}\) without further training or fine-tuning. Increasing the resolution of the downsampling layers is not possible with FPS, as this would exceed the memory limit of the Geforce GTX 1080 Ti. Nevertheless, the accuracy of our method remains stable over a wide range of densities for both RS and FPS versions (cf. Fig. 7). To maintain the accuracy of WM3D (Wang et al., 2022a) and to evaluate more than 8192 input points, we had to increase the resolution rate of the downsampling layers based on the resolutions suggested in Wang et al. (2022a). Yet, for the competing methods WM3D Wang et al. (2022a) and Bi-PointFlowNet (Cheng and Ko, 2022), the maximum possible densities are limited to 16384 and 32768, respectively, since they exceed the memory limit of the Geforce GTX 1080 Ti at higher densities. For the other state-of-the-art-methods FLOT (Puy et al., 2020), PV-RAFT (Wei et al., 2021), PointPWC-Net Wu et al. (2020), and HPLFlowNet (Gu et al., 2019), the maximum possible densities are limited to 8192, 8192, 32768 and 65536, respectively, for the same reason (not shown in Fig. 7).

Fig. 8
figure 8

Analysis of accuracy for different depth limits on \(\textrm{FT3D}_{\textrm{s}}\) and \(\textrm{KITTI}_{\textrm{s}}\) compared to state-of-the-art methods

In contrast, our RMS-FlowNet++ allows very high densities with high accuracy without exceeding the memory limit of the Geforce GTX 1080 Ti. Although FPS is computationally expensive, the reduced number of nearest neighbors (\(K_p=16\)) allows the operation with \(\sim 225\)K points. Using RS with the increased number of nearest neighbors (\(K_p=20\)) allows our RMS-FlowNet++ to operate 5 to 6 times faster than with FPS, especially at densities \(>65K\). Consequently, the design of RMS-FlowNet++ allows a much higher maximum density compared to other methods in terms of memory requirement and time consumption. However, the runtime of our RMS-FlowNet++ still increases super-linearly with increasing input density \(>225\)K due to the KNN search. In addition, the initial drop in accuracy at \(\sim 225\)K points in Fig. 7 indicates that it may be necessary to further increase the resolution of the downsampling layers when using even higher densities.

We visually present some results on \(\textrm{KITTI}_{\textrm{s}}\) with dense points ( \(\sim 50\)K points) and three examples of non-occluded points of the \(\textrm{FT3D}_{\textrm{s}}\) ( \(\sim 300\)K points) in Fig. 9. To obtain a denser scene in the \(\textrm{KITTI}_{\textrm{s}}\), we include distant points down to \(<210\) meters. Our RMS-FlowNet++ shows a high accuracy even with this very dense data.

4.6 Varying Depth Ranges

We emphasize that all state-of-the-art methods only consider objects in the near range (\(<35\) meters) during training and evaluation. We consider the same range during training, but in this work, for the first time, we evaluate the accuracy of scene flow for more distant objects using the \(\textrm{FT3D}_{\textrm{s}}\) and \(\textrm{KITTI}_{\textrm{s}}\) data sets (cf. Fig. 8).

Fig. 9
figure 9

Three examples from the non-occluded versions of \(\textrm{FT3D}_{\textrm{s}}\) and \(\textrm{KITTI}_{\textrm{s}}\) show that our RMS-FlowNet++ allows high point densities with high accuracy using RS. The scene of each example (first and third rows) visualizes \(P^t\) as green color. The error map of each scene (second and forth rows) shows the end-point error in meters according to the color map shown in the last row (Color figure online)

Fig. 10
figure 10

Two examples taken from \(\textrm{KITTI}_{\textrm{s}}\) show the impact of our RMS-FlowNet++ compared to the competing method Bi-PointFlowNet (Cheng and Ko, 2022). The scene of each example (first and forth rows) visualizes \(P^t\) as blue color and the predicted and ground truth scene flow after adding them to \(P^t\) in green and red color, respectively. The error map of each scene (second, third, fifth and sixth rows) shows the end-point error in meters according to the color map shown in the last row. Our RMS-FlowNet++ shows lower errors (dark blue) over a wide area of the observed scene, compared to the competing method (Color figure online)

On \(\hbox {FT3D}_{\textrm{s}}\), the accuracy of our RMS-FlowNet++ with RS and FPS is better than the competing methods for every depth limit. The accuracy of WM3D (Wang et al., 2022a) decreases significantly, and the accuracy of Bi-PointFlowNet (Cheng and Ko, 2022) is \(\sim 8\%\) lower than ours. Surprisingly, RS generalizes slightly better than FPS to increasing depth limits. When RMS-FlowNet++ is trained and evaluated with FPS (marked with \(^*\)), the results are on par with our prior work RMS-FlowNet (Battrawy et al., 2022) and the competing method Bi-PointFlowNet (Cheng and Ko, 2022). When evaluated on \(\textrm{KITTI}_{\textrm{s}}\), our RMS-FlowNet++ shows significantly better results than our prior work RMS-FlowNet (Battrawy et al., 2022). Furthermore, both sampling strategies perform significantly better than WM3D (Wang et al., 2022a). However, when trained with RS and evaluated with FPS, the scene flow accuracy exceeds that of Bi-PointFlowNet (Cheng and Ko, 2022).

It follows that training with RS can better generalize to a wider range of points than FPS, leading to better scene flow accuracy with FPS than training with FPS itself. In other words, FPS causes the downsampled points to cover approximately the same spatial locations, which reduces the variation during training.

Table 4 We explore our improvements in RMS-FlowNet++ compared to our preliminary work RMS-FlowNet (Battrawy et al., 2022)

Qualitatively, we visualize two scenes of \(\textrm{KITTI}_{\textrm{s}}\) with the corresponding error maps in Fig. 10. We qualitatively compare our RMS-FlowNet++ with both sampling techniques to the most competitive method (Cheng and Ko, 2022). Without changing the training strategy, we compare the predicted scene flow with the narrowest depth range (\(<35\)m) and with the widest range (\(<210\)m) against Bi-PointFlowNet. Testing within the trained depth range (\(<35\)m), we see that Bi-PointFlowNet has higher errors on flat surfaces than our approach, which produces the best results with FPS. Testing outside the trained depth range (\(>35\)m) shows that the accuracy decreases when distant points (\(<210\)m) are included. In Cheng and Ko (2022), even nearby objects (e.g. moving cars) are negatively affected when distant points (\(>35\)m) are included in the scene. However, our RMS-FlowNet++ performs robustly in this case.

4.7 Ablation Study

To speed up our experiments, we verify our design decisions, the additional components in the FE by training with FPS, which converges faster than RS. We also compare RS with FPS and increase the KNNs to test the effect on the results. Finally, we compare the impact of our augmentation on the overall results with RS.

Design Decisions: First, we reduce the number of correlation points (\(K_p\)) from 33 in our preliminary work of RMS-FlowNet (Battrawy et al., 2022) to 16 in the whole design (i.e., all scales of LFA and FE), then we verify our improvements in RMS-FlowNet++ as shown in Table 4. The accuracy of our preliminary work RMS-FlowNet decreases by reducing the number of correlation points to 16, but it is improved again by using the graph representation of Eq. (1) in our FE, which adds the \(f_{i}^{t}\) part to the original representation in RMS-FlowNet (Battrawy et al., 2022). Second, we slightly improve the results by omitting the decoder part in the feature extraction which saves more operations and upsampling layers and avoids the use of KNN search. Third, we show the positive effect of adding the \(2^{nd}\) embedding step to the FE of RMS-FlowNet (Battrawy et al., 2022), which is based on the feature space. Then, we check the positive effect of our similarity map (i.e., one-to-one map) based on the feature space to find a pair of matching points under a bidirectional constraint as explained in Sect. 3.2. Finally, we show the final results using the FPS of the model, but trained using the RS sampling technique with correspondence set (\(K_{p}\)) during training. Compared to RMS-FlowNet (Battrawy et al., 2022), the reduction of \(K_{p}\) makes the method more efficient, while the sum of architectural changes improves the results.

Aspects of Attention in FE: We design our Flow-Embedding (FE) with two maximum embedding layers based on both Euclidean and feature space followed by two attentive embedding layers (see Fig. 5). Focusing on our stacked attention layers, we verify three important aspects of our stacked attention design on the \(\hbox {FT3D}_{\textrm{s}}\) data set as follows:

  1. 1.

    Adding the feature of reference frame \(f_{i}^{t}\) (as in Eq. (5)) to the input of the attention mechanism.

  2. 2.

    Adding the residual connection (Res. Conn.) as in Fig. 5 or \(e_{2i}^{t}\) as in Eq. (10).

  3. 3.

    Encoding of the spatial locations to the features and the concatenation to the \(\hat{f}_{i}^{t}\) in the Eq. (5).

Combining all of the above components yields more accurate results, as verified in Table 5.

Table 5 We verify the aspects of our stacked attention in flow embedding on \(\hbox {FT3D}_{\textrm{s}}\) data set
Table 6 We evaluate the number of KNNs that can be used for both FPS and RS sampling techniques
Table 7 We evaluate the generalization of each sampling technique to the other on FT3Ds (Mayer et al., 2016) and KITTIs (Menze and Geiger, 2015)
Table 8 We study the effect of augmentation on FT3Ds (Mayer et al., 2016) and KITTIs (Menze and Geiger, 2015)

Training with FPS vs. RS: First, we train with FPS using different numbers of \(K_{p}\) and evaluate with the same numbers used in training to determine the appropriate number of \(K_{p}\) (i.e., the correct correspondence set) that gives the best results, as shown in Table 6. We find that small correspondence sets such as 8 and 12 have lower accuracy than 16 and 20, which both give roughly comparable results, making them appropriate numbers, but at the cost of a higher number of FLOPs. Based on this, we train with RS and these determined numbers of \(K_{p}\) (i.e., 16 and 20). After evaluation, we find that \(K_{p}=20\) works best with RS, as shown in Table 6. Based on this, we set \(K_p\) to 16 and 20 for FPS and RS, respectively. We then perform cross evaluations with both sampling techniques to verify which sampling method generalizes better. The results are shown in Table 7. RS generalizes much better than FPS. It even improves the results when evaluated with FPS, compared to the original training with FPS.

Impact of Augmentation: We examine the effect of our data augmentation (cf. Sect. 4.3) individually. The results of these experiments are shown in Table 8. As mentioned before, to speed up the tests, we train with FPS and a correspondence set of 16.

When training without any augmentation, the results are good on \(\textrm{FT3D}_{\textrm{s}}\), but generalize poorly to the real-world data of \(\textrm{KITTI}_{\textrm{s}}\). When we randomize the initial sampling in each epoch (i.e., change the spatial locations for each epoch), the results on the \(\textrm{FT3D}_{\textrm{s}}\) data set drop slightly, but the accuracy and end-point error on \(\textrm{KITTI}_{\textrm{s}}\) are significantly improved. We observe a similar behavior when adding only the geometric augmentation, with an even larger positive impact on \(\textrm{KITTI}_{\textrm{s}}\). Both augmentation strategies together improve the overall results on both data sets and provide the best generalization from synthetic to real scenes.

Fig. 11
figure 11

Three examples from the non-occluded versions of \(\textrm{FT3D}_{\textrm{s}}\) show the failure cases of our RMS-FlowNet++ with high point densities using RS. The scene of each example (first row) visualizes \(P^t\) as green color. The error map of each scene (second row) shows the end-point error in meters according to the color map shown in the last row (Color figure online)

4.8 Limitations

In terms of accuracy, there are three major limitations: (1) Any errors at the coarsest level can accumulate in the higher resolution layers, degrading the overall accuracy. (2) Our one-to-one bidirectional matching at the coarsest resolution can lead to mismatches if the scene contains repetitive patterns (e.g., along road pillars or traffic barriers). (3) Areas of homogeneous geometry (e.g., road surfaces or grass along the road) pose a challenge to our model, especially when RS is used (cf. Fig. 10). The error increases significantly when these untextured objects are represented or scanned by high-density points, as shown in Fig. 11. In terms of efficiency, two major limitations remain: (1) To maintain accuracy at higher input densities, it is necessary to increase the resolution rates in the downsampling layers, which increases the runtime and memory requirements (cf. Fig. 7). (2) The KNN search dominates the computational complexity as the input density increases.

5 Conclusion

In this paper, we propose RMS-FlowNet++ – an efficient and fully supervised network for multi-scale scene flow estimation in high-density point clouds. By using Random-Sampling (RS) during feature extraction, we are able to boost the runtime and memory footprint for an efficient processing of point clouds with an unmatched maximum density. The novel Flow-Embedding module (called Patch-to-Dilated-Patch), resolves the prominent challenges in using RS for scene flow estimation. Compared to our preliminary work Battrawy et al. (2022), we reduce the operations in our network and improve the accuracy. We demonstrate the advantages of RS over FPS on high-density point clouds and its ability to generalize to FPS during inference. We provide an intensive benchmark, in which our RMS-FlowNet++ achieves the best results in terms of accuracy, generalization, and runtime compared to the previous state-of-the-art. We also investigate the robustness of our network to occlusions and explore its ability to operate on long-range point clouds (i.e., up to 210 meters).

In the future, we would like to improve our model by fusing the 3D information of point clouds with textural 2D information captured by RGB cameras. We also plan to add ego-motion estimation to our model to avoid inaccuracies in static, homogeneous areas such as the road surface.